Transformer-Based Multilabel NER Using Wikipedia Corpora in Multiple Languages

wikidata 2025-05-20

Stud Health Technol Inform. 2025 May 15;327:878-879. doi: 10.3233/SHTI250488.

ABSTRACT

The high cost of manual data labeling and privacy concerns result in a considerable dearth of medical annotations in non-English texts. Recent work by Frank and Kramer [1] introduces an unsupervised approach for constructing an ontology-annotated corpora from Wikipedia (https://www.wikidata.org) for German medical NER. We evaluate the proposed approach across English, German, Spanish, and French for medication and diagnosis entity recognition. Our multilabel corpora yield notable improvements in German medication detection under sparse annotations compared to the baseline, with consistent performance across other languages.

PMID:40380596 | DOI:10.3233/SHTI250488