NLP Reading notes #1

18 November 2021 0 By Judicael Poumay (Ph.D.)


As a PhD student I spend a large part of my time reading. I read about things that interest me. I read about things that are related to my current research and I read things related to my field to stay up to date. However, with all this reading I tend to accumulate a lot of notes that I don’t know what to do with. Moreover, there is a clear lack of blogs discussing the current advances in the scientific literature in NLP. Hence, this article will start a series in which I will provide a summary of various articles that caught my interest recently.

Rare Words Degenerate All Words (S.Yu, et al. 2021)

Word embeddings are a way to encode the meaning of words into vectors of numbers. They provide an efficient way to manipulate words using mathematical tools and as such are used in many NLP applications. However, their underlying mechanisms are complex and hard to grasp. Thus, a lot of scientific work is done to study the mathematical space they create.

We would prefer word embeddings to encode words homogeneously into the mathematical space; meaning spreading words uniformly across that space. However, many studies demonstrated that in practice word embeddings tend to degenerate into a narrow-cone distribution. This uneven distribution means all words are encoded as more similar than they really are. This is problematic because this means the embedding encodes partially wrong information about words.

This study shows that one of the biggest causes of degeneration is rare words. Indeed, rare words by their nature appear in fewer contexts. Consequently, rare words are less stable statistically as their usage varies more from corpus to corpus. Hence, this study provides a potential solution they call Adaptive Gradient Partial Scaling. As the gradient can be decomposed with respect to each word. The gradient coming from rarer words can be scaled negatively to reduce their impact and alleviate the problem of degeneration. In other words, this method reduces the impact of rare words on common words when learning the embedding.

Lacking the embedding of a word? Look it up into a traditional dictionary (E.Ruzzetti, et al.  2021)

One other problem with word embeddings is that they often cannot deal with words they do not know. Thus, these so-called out-of-vocabulary words are often ignored when processing textual data. These are often rare or new words but it would be nice if we could still create an embedding for them.

This paper proposes a simple solution. When encountering an out-of-vocabulary word, they will look up its definition in a dictionary. The words used to define that word are more likely to have a known embedding which can be mixed to produce an embedding for the out-of-vocabulary word. More specifically, they select the two most relevant words from a definition. Their experiments show that their technique provides better performance than classical word embeddings.

Phonetic Word Embeddings (R.Sharma, et al. 2021)

Words are not just written, they are also spoken. Hence, this paper proposes a novel method of word embedding using phonetic information. Using this method, similar sounding words such as ate, eight, or weight will have similar vector representation. This kind of embedding can be useful for downstream phonology tasks such as poetry generation.

On the Universality of Deep Contextual Language Models (S.Bhatt, et al. 2021)

Broadly speaking, a language model is a system capable of predicting a word given previous words. It has many applications from text generation to chatbots. This paper attempts to define the characteristics necessary for a language model to be universal. They define 7 dimensions

1) Language : there are over 7000 languages in the world and a language model should be able to deal with all of them, or at least those used by most people.. 

2)Multilingualism : some people sometimes talk using two or more languages at the same time (e.g. English/French). A good language model should be able to deal with such a situation.

3) Tasks : a language model learns to create word embeddings that can be used for a variety of tasks. A good language model should learn a representation that provides good performance on all kinds of NLP tasks. 

4) Domain : Different professions have different vocabulary; the words used by a lawyer differ from the words used by a bricklayer. Hence, a good language model should be able to handle all kinds of lexical fields without confusing them. 

5) Medium of Expression : human language varies greatly whether you are talking to a friend, your boss, through sms or mail. A good language model should be able to deal with formal as well as colloquial language equally. 

6) Geography and Demography : human language can also vary greatly depending on the region in which it is spoken : London English vs Scottish English or Paris French vs Marseille French can differ greatly. Hence, a good language model should be able to deal with various regional differences. 

7)Time Period : finally, language also changes over time. Words and expressions evolve but a language model should be able to differentiate and handle old and modern use of language. 

In this paper the author precised that they might have missed other dimensions but this already provides a good start for a framework of what a good language model should strive for. They also discuss that in the current literature, dimensions like language, task, and domain are widely studied. However, dimensions like multilingualism, geography and demography, and time period receive less attention. Hence, there are many research opportunities in these directions. 

Towards Zero-Label Language Learning (Z Wang, et al. 2021)

The biggest problem in machine learning at the moment is that you need a lot of annotated data to train models. While raw data is widely available, annotated data is expensive. Hence, this paper proposes a way to generate synthetic-annotated data with a similar quality as human-annotated data. To generate such annotated data they ask a language model to generate an input for a given output. Thus, they can iteratively create a synthetic and annotated dataset. While simple, their method shows impressive performance with their methods on text classification and NLU. While their method remained confined to a few NLP applications, it is a good start toward automated data labelling.

Paradigm Shift in Natural Language Processing (T.Sun, et al.  2021)

This paper discusses the recent shifts in paradigm in NLP. A paradigm can be broadly understood as the current way of doing things. In NLP, most tasks used to fall under a few paradigms : 

  • Classification
  • Matching
  • Sequence labelling
  • Sequence to sequence
  • Machine Reading Comprehension 
  • Sequence-to-Action-Sequence 
  • Language modelling 

However, recently we’ve had more and more paradigm shifts. Specifically classification, Sequence labelling, and Sequence-to-Action-Sequence are less and less used. While Language modelling, Machine Reading Comprehension and Sequence to sequence have gained in popularity. Language modelling in particular has become ubiquitous, especially because they require less data and generalize well to many tasks. Matching is stable although it is losing some popularity.

Judicael Poumay (Ph.D.)