Automatic Labeling of Diagnosis in Medical Reports in Serbian

Authors: A. R. Avdic, U. A. Marovac, D. S. Jankovic, S. S. Marovac

Keywords: Electronic health records, natural language processing, diagnosis labeling.

Abstract:

Abstract: A large number of patient health data is collected daily in medical information systems. This data contains a non-structural part written in natural language that contains the physician’s notes on specific characteristics of the patient’s medical condition. This section may contain symptoms, diagnoses, therapies, specialties, Latin terms, and other words specific to the medical domain. Useful information suitable for various analyzes could be extracted by processing this section of the text. There are no electronic lexical resources in the Serbian language that are suitable for normalizing and extracting knowledge from medical texts, as well as methods for marking terms in this domain. One reason is that, before any method is applied, the de-identification of patients and staff must be ensured. Also, the evaluation of the results requires manually marked corpora of medical reports in the Serbian language. This paper proposes a method for identifying words belonging to diagnoses in medical texts written in Serbian using natural language processing (NLP) techniques. The proposed method is based on the use of lexical resources, and two set of 1000 medical reports are manually marked for research purposes. In the experimental part, the results of automatic labeling of diagnoses on the marked corpus using the proposed method are presented.

References:

[1] R. ROSALES, Method for Automatic Labeling of Unstructured Data Fragments From Electronic Medical Records., U.S. Patent Application, No. 12/469,745, 2009. [2] M. BUCKLEY, B. COOPEY, J. SHARKO, F. POLUBRIAGNOF, B. DROHAN, K. BELLI, …, and S. SPECHT, The feasibility of using natural language processing to extract clinical information from breast pathology reports, Journal of pathology informatics.Vol. 3, 1 (2012), 23. [3] W. CHAPMAN, M. NADKARNI, L. HIRSCHMAN, W. D’AVOLIO, K. SAVOVA and O.UZUNER, Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions, J Am Med Inform Assoc. Vol. 18, 5 (2011), 540-3. [4] A. MILENKOVIC , P. RAJKOVIC, T. STANKOVIC and D. JANKOVIC , Application of medical information system MEDIS. NET in professional learning, In 2011 19th Telecommunications Forum (TELFOR) Proceedings of Papers IEEE, Belgrade, 2011, pp. 1474–1477. [5] W. SUN, A. RUMSHISKY and O. UZUNER, Evaluating Temporal Relations in Clinical Text:2012 i2b2 Challenge, Journal of American Medical Informatics Association. Vol. 20 (2013), 806-813. [6] M. MSAEED, M. VILLARROEL, A. REISNER, G. CLIFFORD, L. W. LEHMAN, …, and R. MARK, Multiparameter Intelligent Monitoring in Intensive Care II (MIMICII): A publicaccess intensive care unit database, Published in final edited form as: Crit Care Med. Vol. 39, 5 (2011), 952–960. [7] V. VINCZE, G. SZARVAS, R. FARKAS, G. M´ORA and J. CSIRIK, The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics, BMC Bioinformatics. Vol. 9, 11 (2008), S9. [8] H. DALIANIS, A. HENRIKSSON, M. KVIST, S. VELUPILLAI and R. WEEGAR, HEALTH BANK-A Workbench for Data Science Applications in Healthcare, In CAiSE Industry Track. (2015), 1–18. [9] S. ANTONIC and C. KRSTEV, Serbian Wordnet for biomedical sciences, In INFORUM, 2008, pp. 28–30. [10] K. SAVOVA, J. MASANZ, V. OGREN, J. ZHENG, S. SOHN and C. KIPPER-SCHULER,Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications, Journal of the American Medical Informatics Association. Vol. 17, 5 (2010), 507-513. [11] V. GARLA, L. RE III, Z. DOREY-STEIN, F. KIDWAI, M. SCOTCH, … and C. BRANDT, The Yale cTAKES extensions for document classification: architecture and application, Journal of the American Medical Informatics Association. Vol. 18, 5 (2011), 14-620. [12] E. SOYSAL, J. WANG, M. JIANG, Y. WU, S. PAKHOMOV, H. LIU and H. XU, CLAMP–a toolkit for efficiently building customized clinical natural language processing pipelines, Journal of the American Medical Informatics Association. Vol. 25, 3 (2011), 331-336. [13] A. MULYAR, D. MAHENDRAN, L. MAFFEY, A. OLEX, G. MATTEO, N. DILL and B. MCINNES, TAC SRIE 2018: Extracting Systematic Review Information with MedaCy, Strain 372 (2018), 338. [14] B. SRINIVASA-DESIKAN, Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras, Packt Publishing Ltd, 2018. [15] W. SUN, Z. CAI, Y. LI, F. LIU, S. FANG and G. WANG, Data processing and text mining technologies on electronic medical records: a review, Journal of healthcare engineering. Vol.2018 (2018), 1-10. [16] INTERNATIONAL STATISTICAL CLASSIFICATION OF DISEASES AND RELATED HEALTH PROBLEMS, https://www.icd10data.com. [17] U. MAROVAC, A. AVDIC, D. JANKOVIC and S. MAROVAC, Creating Resources for Marking Diagnoses in Electronic Health Reports in Serbian, International Journal of Electrical Engineering and Computing. Vol. 4, 1 (2020), 18-23. [18] N. MILOSEVIC, Stemmer for the Serbian language, arXiv 1209.4471, 2012.