Supervised training on a model¶
Not everything can be learned in a self-supervised manner. As a trivial exmaple, if there were two concepts - both only known by the same name, the unsupervised training approach won't be able to distinguish between them.
For instance abscess could regfer either to a body structure (44132006) or a clinical finding (128477000) depending on conext.
And if there's no unique name that refers to either term, they cannot be disambiguated using only self-supervised methods.
In reality, these terms can also be known by names that are distinct from oneanother (e.g abscess morphology for 44132006 and abscess disorder for 128477000).
But even this would require training data that actually uses these distinct terms.
In any case, supervised training also has the added benefit of having a humans curate the names and linked concepts. While this may be achieved in many different ways, when using MedCAT, we like to use a tool called MedCATtrainer.
First things first - we load the existing model.
import os
from medcat.cat import CAT
model_path = os.path.join("models", "unsup_trained_model.zip")
cat = CAT.load_model_pack(model_path)
Now that we have the model, we add the two new concepts.
We employ the help of the CDBMaker again.
Though this is not strictly speaking necessary this time around since the CAT instance has created its own tokenizer anyway.
import pandas as pd
from medcat.model_creation.cdb_maker import CDBMaker
cdb_maker = CDBMaker(cat.config, cat.cdb)
df = pd.DataFrame({"name": 'abscess', "cui": ['44132006', '128477000']})
print("DF:\n", df)
cdb_maker.prepare_csvs([df])
print("CUIs:", [(cui, cat.cdb.get_name(cui)) for cui in cat.cdb.cui2info.keys()])
DF:
name cui
0 abscess 44132006
1 abscess 128477000
CUIs: [('73211009', 'Diabetes Mellitus Diagnosed'), ('396230008', 'Wagner Unverricht Syndrome'), ('44132006', 'Abscess'), ('128477000', 'Abscess')]
Now we have the concepts that we're trying to disambiguate. All we need now is a dataset where a human has gone through the mentions and annotated them for the correct type. And then we can use that for supervised training.
First, we will verify that we cannot detect abscess before training.
abscess_text_morph = """Histopathology reveals a well-encapsulated abscess with central necrosis and neutrophilic infiltration."""
abscess_text_disorder = """An abscess is a disorder, which is a clinical condition characterized by the formation of a painful and inflamed mass containing purulent material"""
# for reuse later
def find_texts():
for text_num, text in enumerate([abscess_text_morph, abscess_text_disorder]):
ents = cat.get_entities(text)['entities']
print(text_num, ":", ents)
find_texts()
0 : {}
1 : {}
Now that it's clear that we can't, we're sure to need some supervised training. Let's then do that.
import json
# NOTE: The instances within the text were annotated by a layman and may not be clinically accurate.
# The dataset in question serves only as an example for the sake of the tutorial.
mct_export_path = os.path.join("in_data", "MCT_export_abscess.json")
with open(mct_export_path) as f:
mct_export = json.load(f)
cat.trainer.train_supervised_raw(mct_export, use_filters=True)
print("Trained concepts:",
[(ci['cui'], cat.cdb.get_name(ci['cui']), ci['count_train']) for ci in cat.cdb.cui2info.values() if ci['count_train']])
Epoch: 0%| | 0/1 [00:00<?, ?it/s]
Trained concepts: [('73211009', 'Diabetes Mellitus Diagnosed', 2), ('396230008', 'Wagner Unverricht Syndrome', 1), ('44132006', 'Abscess', 2), ('128477000', 'Abscesses', 2)]
What about now? Can we differentiate between the two concepts in text?
find_texts()
0 : {0: {'pretty_name': 'Abscess', 'cui': '44132006', 'type_ids': [], 'source_value': 'abscess', 'detected_name': 'abscess', 'acc': 0.99, 'context_similarity': 0.99, 'start': 43, 'end': 50, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
1 : {0: {'pretty_name': 'Abscesses', 'cui': '128477000', 'type_ids': [], 'source_value': 'abscess', 'detected_name': 'abscess', 'acc': np.float64(0.4624653188952008), 'context_similarity': np.float64(0.4624653188952008), 'start': 3, 'end': 10, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
Save the changed model¶
We have now fine-tuned the model. So we may want to save its state again.
new_model_folder, new_model_name = "models", "sup_trained_model"
cat.save_model_pack(new_model_folder, pack_name=new_model_name, add_hash_to_pack_name=False)
'models/sup_trained_model'