Gewählte Publikation:
Oleynik, M.
Leveraging Word Embeddings for Biomedical Natural Language Processing.
PhD-Studium (Doctor of Philosophy); Humanmedizin; [ Dissertation ] Graz Medical University; 2020. pp. 127
[OPEN ACCESS]
FullText
- Autor*innen der Med Uni Graz:
-
Oleynik Michel
- Betreuer*innen:
-
Berghold Andrea
-
Schulz Stefan
- Altmetrics:
- Abstract:
- Driven by the decreasing costs of whole genome sequencing, the field of Precision Medicine has gained traction to allow targeted treatment choices for patients with specific biomarkers. To reach that goal, automated processing of the huge unstructured data pool in electronic health records and online resources such as PubMed is necessary to aid time-constrained health professionals not only in delivering precise treatments but also in building representative cohorts for new clinical trials. While Natural Language Processing (NLP) has shown great progress in several domains with the public availability of huge collections that enable Deep Learning (DL) approaches, the same has not been seen in the clinical field due to ethical concerns with data and model sharing. To overcome that disparity, recent advances in transfer learning methods --- such as context-based Word Embeddings (WE) --- have facilitated partial reuse of these large models.
With the overall goal of improving precision medicine, this work leverages WE for biomedical NLP in three lines of research: (a) clinical text cleansing; (b) clinical text classification; and (c) biomedical information retrieval. In the line of research (a), I demonstrated a novel method that associates WE with a minimal set of filtering rules to expand acronyms in a totally unsupervised way. This method outperformed traditional approaches using both n-grams and a hand-crafted sense inventory. In the line of research (b), I explored several methods for clinical phenotyping and cohort building. I verified that logistic regression associated with WE constituted a better model for clinical text classification than more complex DL architectures and also determined that embeddings pre-trained on a larger corpus were not better than embeddings trained on the target dataset. Finally, in the line of research (c), we proposed a method for query expansion that does not affect the precision of results in a biomedical information retrieval scenario. Using this method, I showed that WE could be effectively used to increase recall when structured resources were not available and additionally revealed that the benefit of query expansion was larger in a small dataset.