Germany's next topic model
The aim of this talk is to present a novel way of detecting topics that is especially suited for user generated content where topics are not as clearly separated as in the typical examples of Wikipedia or newsgroup articles. The basic idea is to compute a contextual similarity score that defines a network from which we can identify clusters through community detection.
Tags: Artificial Intelligence, Deep Learning & Artificial Intelligence, Data Science, Networks, NLP, Machine Learning, Visualisation
Scheduled on wednesday 12:20 in room cubus
Thomas Mayer is an NLP Data Scientist at HolidayCheck, located in Munich.
Identifying topic models for user generated content like hotel reviews turns out to be difficult with the standard approach of LDA (Latent Dirichlet Allocation; Blei et al., 2003). Hotel review texts usually don't differ as much in the topics that are covered as is typical with other genres such as Wikipedia or newsgroup articles where there is commonly only a very small set of topics present in each document.
To this end, we developed our own approach to topic modeling that is especially tailored to non-edited texts like hotel reviews. The approach can be divided into three major steps. First, using the concept of second-order cooccurrences we define a contextual similarity score that enables us to identify words that are similar with respect to certain topics. This score allows us to build up a topic network where nodes are words and edges the contextual similarity between the words. With the help of algorithms from graph theory, like the Infomap algorithm (Rosvall and Bergstrom, 2008), we are able to detect clusters of highly connected words that can be identified as topics in our review texts. In a further step, we use these clusters and the respective words to get a topic similarity score for each word in the network. In other words, we transform a hard clustering of words into topics into a probability score of how likely a certain word belongs to a given topic/cluster.
The presentation is structured as follows:
- short overview of existing topic modeling approaches
- shortcomings of these approaches with respect to our domain (hotel review texts)
- explaining the contextual similarity score and its relationship to word embeddings
- topic modeling step through community detection
- turning the hard clustering into a fuzzy topic model
References: David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent dirichlet allocation. In: Journal of Machine Learning Research, Jg. 3 (2003), S. 993–1022, ISSN 1532-4435 M. Rosvall and C. T. Bergstrom, Maps of information flow reveal community structure in complex networks, PNAS 105, 1118 (2008) http://dx.doi.org/10.1073/pnas.0706851105, http://arxiv.org/abs/0707.0609