Creating a stop word dictionary in Serbian

Authors: Marovac U.A., Avdić A.M., Ljajić A.B.

Keywords: stop words; Serbian; text mining; natural language processing; normalization

Abstract:

Abstract: By using natural language processing techniques, it is possible to get a lot of information from the extraction of document topics through mapping of document key words or content-based classification of documents, etc. To get this information, an important step is to separate words that carries informative value in a sentence from those words that do not affect its meaning. By using dictionaries of stop words specific to each natural language, the marking of words that do not carry meaning in the sentence is achieved. This paper presents creating a stop word dictionary in Serbian. The influence of stop words to the text processing is presented on three different data set. It is shown that by using proposed dictionary of Serbian stop words the data set dimension is reduced from 15% to 39%, while the quality of the obtained n-gram language models is improved.

References:

[1]Avdić, A.R., Marovac, U.A., Janković, D.S. (2020) Normalization of Health Records in the Serbian Language with the Aim of Smart Health Services Realization. Facta universitatis – series: Mathematics and Informatics, 825-841 [2]Batanović, V. The Serbian Movie Review Dataset (SerbMR). https://vukbatanovic.github.io/project/serbmr [3]Choy, M. (2012) Effective listings of function stop words for twitter. arXiv preprint arXiv:1205.6396 [4]Ethayarajh, K. (2019) How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512 [5]Jurafsky, D., Martin, J. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition [6]Klajn, I. (2005) Gramatika srpskog jezika. Zavod za udžbenike i nastavna sredstva [7]Lo, R.T.W., He, B., Ounis, I. (2005) Automatically building a stopword list for an information retrieval system. Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), Vol. 5,(2005, January), 17-24 [8]Lujo, R. (2010) Locating similar logical units in textual documents in Croatian Language. Croatia: University of Zagreb, Faculty of Electrical Engineering and Computing, Master Thesis (in Croatian) [9]Ljajić, A., Marovac, U. (2019) Improving sentiment analysis for twitter data by handling negation rules in the Serbian language. Computer Science and Information Systems, 16(1): 289-311 [10]Silva, C., Ribeiro, B. (2003) The importance of stop word removal on recall values in text categorization. Proceedings of the International Joint Conference on Neural Networks, IEEE, Vol. 3, (July), 1661-1666 [11]Sinka, M.P., Corne, D. (2003) Evolving Better Stoplists for Document Clustering and Web Intelligence. HIS, (January), 1015-1023 [12]Wilbur, W.J., Sirotkin, K.K. (1992) The automatic identification of stop words. Journal of Information Science, 18(1): 45-55 [13]Zipf, K. (1932) Selected Studies and the Principle of Relative Frequency in Language. Cambridge, MA: MIT Press