Reading notes #4: Topic modelling
Introduction to Topic Modelling
The amount of data generated by mankind has become unmanageable. Thousands of news and scientific articles are being published every day. It is now impossible for people to read all of this information. Yet, the ability to follow what is going on in the world can be a great asset for companies and individuals.
Topic models are one method used to condense and simplify information contained in a set of documents by extracting latent topics. For example, space, coronavirus, and war, … Extracting these topics allows users to quickly understand huge corpora and then focus their attention on what matters.
However, topics extracted with topic models are not single words. An extracted topic is actually a set of words that tend to co-occur. While the topic model extracts these sets, it is still up to the human user to interpret them. For example, we might extract the topic : (space, rocket, launch, galaxy, star). This set of words can be then interpreted as the topic of space.
There are many kinds of topic models. For example, flat topic models extract high-level topics. This provides an overview of a corpus. To get a deeper understanding, we can use hierarchical topic models. They allow us to extract topics and sub-topics. Hence, the topic of coronavirus may have sub-topics such as tests, masks, remote working, supply chain issues, etc.
Topic models are important methods in the field of NLP. A lot of research is being done to develop better models, study the quality of topic extracted, and develop better training algorithms, … In this article, I will present a few articles related to topic models that have caught my eyes recently.
Topics extracted by topic models are sets of words that require human interpretation. In this paper, the authors demonstrate that this interpretation is not always self-evident. Depending on their knowledge, some annotators might not be able to interpret topics correctly. This can occur when topics are made up of highly technical terms. This situation is worsened in hierarchical topic models where deeper sub-topics become extremely esoteric. Hence, this paper argues that topic modelling requires domain experts to be used efficiently.
This paper demonstrates that the feature selection process is essential for topic models. Surprisingly, many implementations of topic models do not put a lot of emphasis on data cleaning and feature selection. Nonetheless, this article demonstrates that by selecting only nouns from texts, the resulting topics are much more interpretable.
This paper proposes a novel topic model that models temporality. This allows users to study the evolution of topics through time. This can be used to detect new topics, disappearing topics, or changes in a topic through time. In particular, this model provides a tree structure where the data is sliced into smaller and smaller periods. This allows users to study periods of various sizes, from years to days, with a single model.
This paper also provides a study of topics through time. This method extracts topics at different time periods and then links similar topics through time. The model is then capable of studying the evolution of topics. Specifically, we can see topics splitting and fusing with time. For example, the topic of space in one period might divide itself into two topics in another: space exploration and astronomy.
Topic modelling may sometimes fail to produce interpretable topics. Indeed, it is an automated process that is agnostic to the semantics of words. Hence, extracted topics have no guarantee to make sense. This paper provides an interactive interface for studying topic models and lists reasons why extracted topics may appear suspicious:
1. Two or more topics are merged into one topic.
2. Two topics are extracted that, to humans, look like duplicates.
3. Extracted keywords of topics do not seem to make sense.
4. Topics contain too many generic terms.
5. Topics that are based on seemingly unrelated terms.
6. Topics do not match human judgement.
7. Topics appear irrelevant.
8. The relationship between topics and documents is not apparent.
Each of these issues can be solved to some degree with better data cleaning and filtering, better parameterization of the model or more knowledgeable annotators.
Topics extracted with topic models are sets of co-occurring words. These words are usually ranked by their probability in that topic. However, this paper proposes a better measure of relevancy for words: FREX. FREX measures both the probability of a word and its exclusivity for a given topic. This exclusivity defines how important a word is for a topic compared to all others. They demonstrated that this measure can help better interpret topics.
This paper proposes to use topic models to discover topic transitions in text. They divide long texts into multiple snippets that are fed into a topic model. Then for every two subsequent snippets, they look at how the topic distribution differs to detect transitions. To evaluate their model, they concatenate different texts to produce known transitions that can be tested. They also provide a good summary of the literature with respect to the length of the text and how it affects topic model performance.