As a data professional, building models is a common topic what differs is just what that model is for? models, should solve certain challenges? then after we consider measuring the quality and performance of these models using evaluation metrics and these are essential to confirm something concerning built models.
Evaluation metrics are used to measure the quality of the statistical or machine learning model.
This article was originally published on the Neurotech Africa blog.
Need for evaluation?
The aim of building AI solutions is to apply them to real-world challenges. Mind you, our real world is complicated, so how do we decide which model to use and when? that is when their metrics come into application.
A failure to know how to justify why your choosing a certain model instead of others or why a certain model is good or not, indicates you are not aware of what your solving or the model you built.
"When you can measure what you are speaking of and express it in numbers, you know that on which you are discussing. But when you cannot measure it and express it in numbers, your knowledge is of a very meager and unsatisfactory kind." ~ Lord Kelvin
Today let's have a sense of what are the metrics used in Natural Language Processing challenges.
Textual Evaluation Metrics
In the Natural Language Processing (NLP) field, it is difficult to measure the performance of models for different tasks, challenge with labels is easier to evaluate but in the case of NLP task, the ground truth or result can be varied.
We have lots of downstream tasks such as text or sentiment analysis, language generation, question answering, text summarization, text recognition, and translation.
It is possible that biases creep into models based on the dataset or evaluation criteria. Therefore it is necessary to make Standard Performance Benchmarks to evaluate the performance of models for NLP tasks. These Performance metrics give us an indication of which model is better for which task.
Let's jump right in to discuss some of the textual evaluation metrics 😊
Accuracy: common metric in sentiment analysis and classification, not the best one but denotes the fraction of times the model makes a correct prediction as compared to the total predictions it makes. Best used when the output variable is categorical or discrete. For example, how often a sentiment classification algorithm is correct.
Confusion Matrix: also used in classification challenges, It provides a clear report on the prediction of models in different categories, from the primary objective visualization of the model the following questions can be answered:-
What percentage of the positive class is actually positive? (Precision)
What percentage of the positive class gets captured by the model? (Recall)
What percentage of predictions are correct? (Accuracy)
Also, we can consider Precision and Recall are complementary metrics that have an inverse relationship. If both are of interest to us then we’d use the F1 score to combine precision and recall into a single metric.
Perplexity: is a great probabilistic measure used to evaluate exactly how confused our model is. It’s typically used to evaluate language models, but it can be used in dialog generation tasks.
The language model refers to how machine-generated text is similar to humans write it. In other words, given w previous word and the correct score of generating w+1 token. The lower you get the perplexity, the better model you have.
Find this article about the perplexity evaluation metric, and take your time to explore Perplexity in Language Models.
Bits-per-character(BPC) and bits-per-word: are other metrics often used for language models evaluations tasks. It measures exactly the quantity that it is named after the average number of bits needed to encode on character.
“if the language is translated into binary digits (0 or 1) in the most efficient way, the entropy is the average number of binary digits required per letter of the original language." ~ Shannon
Entropy is the average number of BPC. The reason that some language models report both cross entropy loss and BPC is purely technical.
In practice, if everyone uses a different base, it is hard to compare results across models. For the sake of consistency, when we report entropy or cross-entropy, we report the values in bits.
Mind you, BPC is specific to character-level language models. When we have word-level language models, the quantity is called bits-per-word (BPW) – the average number of bits required to encode a word
General Language Understanding Evaluation (GLUE): this is a multi-task benchmark based on different types of tasks rather than evaluating a single task. As language models are increasingly being used for the purposes of transfer learning to other NLP tasks, the intrinsic evaluation of a language model is less important than its performance on downstream tasks.
Super General Language Understanding Evaluation(superGLUE): methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. This is the better or modified version of the GLUE benchmark with a new set of more difficult language understanding tasks, and improved resources after a GLUE benchmark performance comes close to the level of non-expert humans.
It comprised new ways to test creative approaches on a range of difficult NLP tasks including sample-efficient, transfer, multitask, and self-supervised learning
BiLingual Evaluation Understudy(BLEU): commonly used in Machine translation and Caption Generation, Since manual labeling for professional translation is very expensive the metric used in comparing a candidate translation(by machine) to one or more reference translations(by a human being). And the output lies in the range of 0-1, where a score closer to 1 indicates good quality translations.
The calculation of BLEU involves the concept of n-gram precision and sentence brevity penalty.
This metric has some drawbacks such as It doesn’t consider the meaning, It doesn’t directly consider sentence structure and It doesn’t handle morphologically rich languages.
Rachael Tatman wrote an amazing article about BLEU just take your time to read it here.
Self-BLEU: this **is a smart use of the traditional BLEU metric for capturing and quantifying diversity in the generated text.
The lower the value of the self-bleu score, the higher the diversity in the generated text. Long text generation tasks like story generation, news generation, etc could be a good fit to keep an eye on such metrics, helping evaluate the redundancy and monotonicity in the model. This metric can be complemented with other text generation evaluation metrics that account for the goodness and relevance of the generated text.
Metric for Evaluation of Translation with Explicit ORdering(METEOR): Precision-based metric to measure the quality of the generated text. Sort of a more robust BLEU. Allows synonyms and stemmed words to be matched with the reference word. Mainly used in machine translation.
METEOR solved two BLEU drawbacks' of not taking recall into account and only allowing exact 𝑛-gram matching. Instead, METEOR first performs exact word mapping, followed by stemmed-word matching, and finally, synonym and paraphrase matching then computes the F-score using this relaxed matching strategy.
METEOR only considers unigram matches as opposed to 𝑛-gram matches it seeks to reward longer contiguous matches using a penalty term known as fragmentation penalty.
BERTScore: this is an automatic evaluation metric used for testing the goodness of text generation systems. Unlike existing popular methods that compute token-level syntactical similarity, BERTScore focuses on computing semantic similarity between tokens of reference and hypothesis.
Bidirectional Encoder Representations from Transformers compute the cosine similarity of each hypothesis token 𝑗 with each token 𝑖 in the reference sentence using contextualized embeddings. They use a greedy matching approach instead of a time-consuming best-case matching approach and then compute the F1 measure.
BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics.
Character Error Rate (CER): this is a common metric of the performance of an automatic speech recognition system. This value indicates the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.
Possible tasks CER can be applied to measure the performance are Speech Recognition, Optical Character Recognition (OCR), and Handwriting Recognition.
Word Error Rate (WER): this is a common performance metric mainly used for speech recognition, optical character recognition (OCR), and handwriting recognition.
When recognizing speech and transcribing it into text, some words may be left out or misinterpreted. WER compares the predicted output and the reference transcript word by word to figure out the number of differences between them.
There are three types of errors considered when computing WER:
- Insertions: when the predicted output contains additional words that are not present in the transcript(for example, SAT becomes essay tea)
- Substitutions: when the predicted output contains some misinterpreted words that replace words in the transcript(for example, noose is transcribed as moose).
- Deletions: when the predicted output doesn’t contain words that are present in the transcript(for example, turn it around becomes turn around).
For understanding let's consider the following reference transcript and predicted output:
- Reference transcript: “Understanding textual evaluation metrics is awesome for a data professional”.
- Predicted output: “Understanding textual metrics is great for a data professional”.
In this case, the predicted output has one deletion (the word “textual” disappears) and one substitution (“awesome” becomes “great”).
So, what is the Word Error Rate of this translation? Basically, WER is the number of errors divided by the number of words in the reference transcript.
WER = (num inserted + num deleted + num substituted) / num words in the reference
Thus, in our example:
WER = (0 + 1 + 1) / 10 = 0.2
Lower WER often indicates that the Automated Speech Recognition (ASR) software is more accurate in recognizing speech. A higher WER, then, often indicates lower ASR accuracy.
The drawback is that it assumes the impact of different errors is the same. Sometimes, insertion error may have a bigger impact than deletion. Another limitation is that this metric cannot distinguish a substitution error from combing, deletion and insertion error.
Recall-Oriented Understudy for Gisting Evaluation (ROUGE): this is Recall based, unlike BLEU which is Precision based. ROUGE metric includes a set of variants: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. ROUGE-N is similar to BLEU-N in counting the 𝑛-gram matches between the hypothesis and reference.
This is a set of metrics used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references or a human-produced summary or translation.
Mind you, in summarization tasks where it’s important to evaluate how many words a model can recall (recall = % of true positives versus both true and false positives)
Feel free to check out the python package here.
Final Thoughts:
Understanding which performance measure to use and the best one for the problem at hand help to validate the right solution to meet the needs of the particular challenge.
The challenge with NLP solutions is on measuring their performance for various tasks. Speaking of other Machine learning tasks, it is easier to measure the performance because the cost function or evaluation criteria are well defined and have a clear picture of what is to be evaluated.
One more reason for this is that labels are well-defined in other tasks, but in the NLP task, the ground result can vary a lot. Coming up with the best model depends on various factors but evaluation metrics are an essential factor to consider depending on the nature of the task you are solving.
References:
- Evaluation Metrics for Language Modeling
- Evaluating Text Output in NLP: BLEU at your own risk
- Evaluation of Text Generation: A Survey
- Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text
- Evaluating Text Generation with BERT
- Automated metrics for evaluating the quality of text generation