Unsupervised training for the core NER+L¶
We've now got a model. But it's really not a great one since it can't differentiate between concepts that share a name. We should be able to rectify that somewhat by doing some training.
First we need to load the model pack we created.
import os
from medcat.cat import CAT
# NOTE: can refer to the .zip or the folder - both will work just fine
model_path = os.path.join("models", "base_model.zip")
cat = CAT.load_model_pack(model_path)
Now we want to provide some data to teach the model the difference between 73211009 (Diabetes mellitus) and 396230008 (Dermatomyositis).
We should be able to do so by providing some data for self-supervised training that with unambiguous names for either concept.
So let's try that.
unsup_trian_texts = [
# text regarding diabetes mellitus (73211009)
"Diabetes mellitus is a metabolic disorder characterized by "
"chronic hyperglycemia due to impaired insulin secretion, "
"insulin resistance, or both. It can lead to complications "
"such as neuropathy, nephropathy, and retinopathy if not well "
"managed. Treatment typically involves lifestyle modifications, "
"blood glucose monitoring, and pharmacologic interventions like "
"insulin or oral hypoglycemics.",
# text regarding dermatomyositis (396230008)
"A renowned painter, once known for his intricate brushwork, "
"found his art hindered by progressive muscle weakness and "
"a distinctive rash on his hands. Doctors diagnosed him with "
"dermatomyositis, an inflammatory condition affecting muscles "
"and skin. Though his strength waned, he adapted his technique, "
"creating expressive works that reflected his resilience in the "
"face of illness."
]
cat.trainer.train_unsupervised(unsup_trian_texts)
print("Trained concepts:",
[(ci['cui'], cat.cdb.get_name(ci['cui']), ci['count_train']) for ci in cat.cdb.cui2info.values() if ci['count_train']])
print("Trained names:",
[(ni["name"], ni["count_train"]) for ni in cat.cdb.name2info.values() if ni["count_train"]])
Trained concepts: [('73211009', 'Diabetes Mellitus Diagnosed', 2), ('396230008', 'Wagner Unverricht Syndrome', 1)]
Trained names: [('diabetes', 1), ('diabetes~mellitus', 1), ('dermatomyositis', 1)]
Note that normally, one would load a larger dataset - e.g from a CSV file - and train based on the data there rather than specifying the text in code.
example_text1 = """DM is a chronic disease caused by impaired insuline excretion."""# definitely diabetes
example_text2 = """Patient diagnosed with DM now has chronic kidney disease."""# probably diabetes
example_text3 = """Patient diagnosed with DM now has difficulty with their fine motor skills"""# probably dermatomyositis
for text_num, cur_text in enumerate([example_text1, example_text2, example_text3]):
print(text_num, ":", cat.get_entities(cur_text)['entities'])
0 : {0: {'pretty_name': 'Diabetes Mellitus Diagnosed', 'cui': '73211009', 'type_ids': [], 'source_value': 'DM', 'detected_name': 'dm', 'acc': np.float64(0.8805509317765855), 'context_similarity': np.float64(0.8805509317765855), 'start': 0, 'end': 2, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
1 : {0: {'pretty_name': 'Diabetes Mellitus Diagnosed', 'cui': '73211009', 'type_ids': [], 'source_value': 'DM', 'detected_name': 'dm', 'acc': np.float64(0.8631518546876605), 'context_similarity': np.float64(0.8631518546876605), 'start': 23, 'end': 25, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
2 : {0: {'pretty_name': 'Wagner Unverricht Syndrome', 'cui': '396230008', 'type_ids': [], 'source_value': 'DM', 'detected_name': 'dm', 'acc': np.float64(0.6594844055302075), 'context_similarity': np.float64(0.6594844055302075), 'start': 23, 'end': 25, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
Note that this only work if the context is somewhat similar to what was in the training data.
And because that's based on our very limited Vocab in this example, and because there has only been 1 training example, we won't be able to get the correct output for all sets of texts
fail_text = """Patient presented with classic signs of DM: they were thirstier than normal, felt tried and weak"""# probably diabetes
print(cat.get_entities(fail_text)['entities'])
{0: {'pretty_name': 'Wagner Unverricht Syndrome', 'cui': '396230008', 'type_ids': [], 'source_value': 'DM', 'detected_name': 'dm', 'acc': np.float64(0.7090981523297809), 'context_similarity': np.float64(0.7090981523297809), 'start': 40, 'end': 42, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
Now that we've got a model that has received some (very limited!) training, we can save it again.
save_path = "models"
mpp = cat.save_model_pack(save_path, pack_name="unsup_trained_model", add_hash_to_pack_name=False)
print("Saved at", mpp)
Saved at models/unsup_trained_model