Word Fairness

Evaluating LMs on word frequency distribution (2020)

Fairness in word type prediction

This research aimed to investigate the relationship between word frequency and prediction accuracy in a word prediction rate. Or how well do LM predict less frequent words. This question is an offshoot of a longer work that formed my thesis that is aimed at overcoming the blindspot of LM where the less they are trained (or fine-tuned) with certain events the harder they would do in predicting them – which is the long-tail problem.

What’s going on in this work

This work is focused on comparing performances on what were then SOTA models on a word prediction task, which means there is a clear output expected to be predicted based on the reference. The technical challenge was to write a Depth-First search component in order to compare only complete words to the reference word (and this is what the repo is offering), as the subword units and methods vary across different LM.


In the first part we found that not only are models less accurate in terms of predicting correctly a word on lower frequency bands, they are also less diverse. In the final part of the paper we can also see that models struggle assigning the right meaning to low frequency words.

You can read more about it (Dudy & Bedrick, 2020) or watch it here, but I have to warn you the sound is not the best ;-)

A paraphrasng task showing that LM relatively fail to correctly score paraphrases when they involve rare words



  1. type_coverage.png
    Are Some Words Worth More than Others?
    Shiran Dudy, and Steven Bedrick
    In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, 2020