Topic modeling of the flow of text documents
Abstract. This paper describes the process of creating Russian language text corpus which is specialized for testing algorithms of probabilistic topic model. The articles of Wikinews licensed by Creative Commons Attribution 2.5 Generic (CC BY 2.5) were used as a source of texts for corpus. The stage of text's preprocessing and markup are described in the conclusion. We proposed an original markup of text corpus for testing algorithms of topic modeling. Keywords: text corpora, topic model, natural language processing, Russian language.
Abstract. In this paper, we describe an approach to multi-label classification of text documents based on probabilistic topic modeling. On the basis of SCTM-ru a topic model has been built with the help of supervised learning. A multi-label classification algorithm is presented. We propose tools for multi-label classification implementing this approach. Keywords: multi-label classification, supervised learning, topic model, natural language processing.
Introduction: Due to the continuous growth of the internet, increasing amount of news, email messages, posts in blogs, etc., Natural Language Processing systems are in high demand. A popular and promising direction in machine learning and natural language processing is developing topic model algorithms. Most topic model methods deal with static information and a limited vocabulary. In practice, however, we need tools to work with a refillable vocabulary. New words come out every year, some words become obsolete, so refillable vocabularies are especially important for Online Topic Models. Purpose: We develop an approach to determine the topical vector for a new word using the Hadamard product of the topical vectors of the documents where this word was found. This approach will be an alternative to the use of Dirichlet distribution or Dirichlet process. Results: The research has shown that a sum of topical vectors in the documents with a new word gives a wrong idea about the topic of this new word. At the same time, it is better to use Hadamard product to specify the topic of a new word by the topics of the documents. Multiplying entrywise the topical vectors of the documents with a new word cancels the topics which do not overlap, separating out common topics with similar meanings. Multiplying the topical vectors of the documents provides a topical vector for the new word with the highest probability values for several most important topics. The values of weakly expressed topics either approach zero or are reset to zero. Practical relevance: The use of the proposed algorithm can infinitely expand the online vocabulary of a topic model and, hence, consider both new and old words. Keywords — Topic Model, Natural Language Processing, Machine Learning
Abstract—The paper introduces an approach to topic model visualization that is characterized by wide possibilities of choosing a method of visualization, user-friendly model representation, and simplicity of implementation for applications. The existing approaches to topic models visualization have been analyzed, and a system, which allows choosing data source for topic models, changing modeling parameters and visualizing the result of topic modeling with IPython has been developed. The example of topic model visualization has been built using the SCTM-en corpus of original news text.