Topic modelling with morphologically analyzed vocabularies

Authors: Spies Marcus

Keywords: computational morphologies; statistical topic models; latent semantic analysis; latent Dirichlet allocation; hierarchical Dirichlet processes; natural language processing

Abstract:

Probabilistic topic modeling is a text mining technique that allows to extract sets of term probability distributions which can intuitively be interpreted as latent topics. The extraction in most techniques uses only document term frequency matrices as input data. Moreover, topic models estimate posterior document-topic distributions useful for intelligent document retrieval query processing. This paper discusses two approaches to topic modeling involving Dirichlet distributions and Dirichlet processes. However, these and related approaches presume suitable text preprocessing in order to keep parameter spaces for estimations from training text corpora at manageable sizes. In the present paper, we discuss the influence of morphological preprocessing of training texts. Morphological analysis is a computer linguistic discipline that allows to decompose observed terms into base lemmata. This is effected by a deep analysis of the observed terms as opposed to straightforward prefix or postfix elimination used in conventional stemming algorithms. Morphological preprocessing is especially effective in inflection rich languages like, e.g. Finnish or German, and effectively reduces the training vocabulary size. In addition, morphological preprocessing allows for decomposing compound words. It is of considerable interest to study the influence of morphological preprocessing on text mining and statistical topic models. In experiments reported in the application section of this paper, significant changes of the frequency structure of document term matrices were found. Interestingly, these changes also led to substantial improvements in model quality indicators of topic models due to morphological preprocessing. Steps for further research are suggested in the concluding section.

References:

[1] Blei, D.M., Ida, C. (2003) http://www.cs.princeton.edu1 blei/lda-c/, URL: http: //www.cs.princeton.edu1%20blei/lda-c [2] Blei, D.M., Griffiths, T.L., Jordan, M.I. (2010) The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2): 1-30 [3] Blei, D.M., Wang, C. (2013) Hierarchical Dirichlet process (with split-merge operations). URL: https://github.com/blei-lab/hdp [4] Blei, D.M., Andrew, N., Jordan, M. (2003) Latent Dirichlet allocation. JMLR, 3 (Jan), pp. 993-1022 [5] Choffrut, C., Culik, I.K. (1983) Properties of Finite and Pushdown Transducers. SIAM Journal on Computing, 12(2): 300-315 [6] Dietz, L. (2011) Directed factor graph notation for generative models. URL: https://github.com/jluttine/tikz-bayesnet [7] Feinerer, I. (2012) An introduction to the tm package: Text mining in R. R News, 8.2, pp. 19-22 [8] Ferguson, T.S. (1973) A Bayesian Analysis of Some Nonparametric Problems. Annals of Statistics, 1(2): 209-230 [9] Frigyik, B.A., Kapila, A., Gupta, M.R. (2010) Introduction to the Dirichlet Distribution and Related Processes. Tech. rep. University ofWashington, Dpt. Electrical Engineering [10] Getoor, L., Taskar, B., eds. (2007) Introduction to statistical relational learning. Cambridge, MA: MIT Press [11] Griffiths, T. L., Steyvers, M. (2004) Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Supplement 1): 5228-5235 [12] Grün, B., Hornik, K. (2012) topicmodels: An R package for fitting topic models. 40, pp. 1-30. URL: http://www.jstatsoft.org/v40/i13 [13] Heckerman, D., Meck, C., Koller, D. (2007) Probabilistic entity-relationship models, PRMs, and plate models. u: Getoor Lise; Taskar Ben [ur.] Introduction to Statistical Relational Learning, Cambridge, MA: MIT Press, pp. 201-238 [14] Kalman, D. (1996) A Singularly Valuable Decomposition: The SVD of a Matrix. College Mathematics Journal, 27(1): 2 [15] Kschischang, F.R., Frey, B.J., Loeliger, H.A. (2001) Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47, 2, 498-519 [16] Landauer, T.K., Dumais, S.T. (1997) A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2): 211 [17] Linden, K., Pirinen, T. Weighting finite-state morphological analyzers using HFST tools. URL: http://www.researchgate.net/publication/ 228912097 [18] Linden, K., et al. (2011) HFST-a system for creating NLP tools. University of Helsinki-Dpt. of Modern Languages, Tech. rep [19] Manning, C.D., Raghavan, P., Schütze, H. (2009) An introduction to information retrieval. URL: http://www.informationretrieval. org [20] Minka, T.P., Winn, J. (2005) Gates: A graphical notation for mixture models. Microsoft Research, Tech. rep. MSR-TR-2005-173 [21] Müller, C. (2015) Identifizieren gemeinsamer inhaltlicher Kriterien in einem heterogenen kirchlichen Bildungsangebot mithilfe von Topic Modelling-Ansätzen. LMU University of Munich-Chair of Knowledge Management, Master Thesis [22] Schmid, H. A programming language for finite state transducers. URL: http: //www.cis.uni-muenchen.de/~schmid/tools/SFST [23] Schmid, H. Stuttgart morphology for German SMOR. URL: http://www.cis.uni-muenchen.de/~schmid/tools/SMOR [24] Sethuraman, J. (1994) A constructive definition of Dirichlet priors. Statistica Sinica, 4, pp. 639-650 [25] Spies, M., Jungemann-Dorner, M. (2013) Big Textual Data Analytics and Knowledge Management. u: Akerkar, Rajendra [ur.] Big Data Computing, Informa UK Limited, str. 501-537 [26] Steyvers, M., Griffiths, T. (2005) Matlab topic modeling toolbox. URL: http: //psiexp.ss.uci.edu/research/programs%7B%5C_%7Ddata/toolbox.htm [27] Sutton, C., McCallum, A. (2007) An introduction to conditional random fields for relational learning. u: Getoor Lise; Taskar Ben [ur.] Introduction to Statistical Relational Learning, Cambridge, MA: MIT Press [28] Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M. (2006) Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476): 1566-1581 [29] Vorontsov, K., Potapenko, A. (2014) Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. Cham: Springer Nature, str. 29-46 [30] Yao, L., Mimno, D., McCallum, A. (2009) Efficient methods for topic model inference on streaming document collections. u: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD ’09, New York, New York, USA: Association for Computing Machinery (ACM), str. 937