Converting histological records into structured data by using a Dependency Grammar

  • Umwandlung histologischer Befunde in strukturierte Daten mittels einer Dependenzgrammatik

Dörenberg, Julian; van der Aalst, Wil M. P. (Thesis advisor); Dahl, Edgar (Thesis advisor)

Aachen : RWTH Aachen University (2022)
Master Thesis

Masterarbeit, RWTH Aachen University, 2022


The availability of structured data is becoming an increasingly critical factor for today’s medical research. In cancer research, data from histological reports are of special interest. Still, pathologists in Germany often document their findings in flowing text. In order to make these high-quality data ready to be processed by computers it is critical to convert them to a structured form. This thesis aims to describe and implement a model which performs relation extraction in three steps. After preprocessing a report, it’s sentences are parsed into a tree of grammatical relations by using a Dependency Grammar parser. As an alternative to Dependency Grammars, Link Grammars are presented and their disadvantages are substantiated. Finally, the grammatical relations returned by the Dependency Grammar parser are filtered by using regular expressions and the ontology database Unified Medical Language System (UMLS). This approach then is evaluated for the performance of the Dependency Grammar parser as well as for the performance of UMLS. The Dependency Grammar parser achieved scores of 94% for Unlabelled Attachment Score, 92% for Labelled Accuracy and 90% for Labelled Attachment Score on a corpus of 200 sentences randomly selected from a corpus of 205 reports (3195 sentences in total) diagnosing breast biopsies. These scores show that Dependency Grammars successfully can be used for parsing histological reports into a structured form. The German UMLS instance is evaluated by classifying words of the corpus as either medical or non-medical. It reached a recall score of0.22, which shows that 22% of the medical terms were correctly classified as medical. The precision score was 0.66 and indicates that 66% of the nonmedical terms were correctly classified as non-medical. The f1 score as the harmonic mean of the two previous scores was 0.33. These three scores show that UMLS currently does not provide sufficient performance to extract structured data from German histological reports. Hence, alternatives are discussed in the outlook of this thesis. Eventually, the whole approach including the filtering by using regular expressions was evaluated. To dos o, UMLS errors were corrected manually. Eventually, the whole On a corpus of ten histological reports where ten different information were to be extracted, the approach extracted 98% of the information correctly.


  • Department of Computer Science [120000]
  • Chair of Computer Science 9 (Process and Data Science) [122510]