Evaluating semantic similarity methods for the comparison of text derived phenotypes and clinical outcome prediction
Here we present an adaptable pipeline for the derivation and comparison of patient phenotypes direct from clinical text, in the form of discharge summaries, test results and clinical notes from the MIMIC-III database. Using semantic similarity of phenotypes, we demonstrate the potential for text-derived phenotype profiles in making diagnosis predictions.
These notebooks require Python 3 (developed in Python 3.8.6) available here, and Jupyter notebook (install instructions here. To run Komenti and SML via these notebooks, Java is required.
The SML toolkit (.jar) should be available in the working directory, available through their website.
To access MIMIC data files, an ethics course must be undertaken. Files can then be downloaded from their webiste, as detailed in the 'Install and setup' notebook.
- Install and setup Relevant downloads, patient sampling and text annotation. Recreate our sample of 1000 patients from MIMIC or create a new sample, extract the text for each patient and apply Komenti to annotate phenotypes.
- Annotation preprocessing Extract phenotypes from the annotation and build patient phenotype profiles, in preparation for using the Semantic Measures Library.
- GenerateXML/Generate XML configuration Produce custom XML files comprising all available similarity measures through SML (listed in 'measures_revised'), split into IC-based, non-IC based and direct groupwise measures.
- Similarity with SML Guidance for running SML in the command line to compare patient phenotypes.
- Results performance Evaluation of the similarity measures for predicting primary diagnosis, producing metrics files that can be used for plotting purposes.
- Results figures Plot ROC curves for individual measures, and histograms of the evaluation metrics across all similarity measures.
A version of all notebooks is available in 'Notebooks_with_output' with our outputs visible for reference.