Definition
BLEU (BiLingual Evaluation Understudy), is a metric used in natural language processing (NLP) and machine translation to evaluate the quality of generated text against one or more high quality reference translations. It measures how similar a machine-generated text is to one or more human-generated reference texts.
BLEU works by comparing n-grams (sequences of n consecutive words) between the generated text and the reference texts. It calculates precision, considering how many n-grams in the generated text match those in the reference text(s). The precision score is then modified by a brevity penalty to avoid favoring shorter translations.
The BLEU score is known to correlate well with human judgment on translation quality.
The formula for calculating the BLEU score involves precision and a brevity penalty. Here's a simplified version of the formula:
Where
-
BP is the brevity penalty to account for the length of the generated text compared to the reference text(s).
-
n is the maximum n-gram order considered (usually 4).
-
P_i is the precision of the i-gram between the generated text and the reference text(s).
The precision p_i for each i-gram is calculated by dividing the number of matching i-grams in the generated text by the total number of i-grams in the generated text. This precision value is multiplied together for all i-gram orders and then raised to the power of the reciprocal of n (the maximum n-gram order).
The brevity penalty (BP) penalizes shorter translations by comparing the length of the generated text with the closest reference text in terms of length. It's calculated as:
Where
-
c is the length of the generated text
-
r is the length of the closest reference text
This penalty prevents overly short translations from receiving disproportionately high scores.
Keep in mind that this is a simplified explanation of the BLEU score formula. The actual computation might involve additional smoothing techniques or modifications for specific variations of BLEU used in different contexts.
Interpretation
The BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the generated text and the reference text(s). Higher BLEU scores generally suggest better translation quality, but it's essential to consider its limitations, such as not accounting for semantic meaning or fluency.
The BLEU score is not an absolute measure, and comparing BLEU scores between passages, languages, or even in the same language with different numbers of reference translations (the more translations, the more likely to match the candidate n-grams) is not accurate.
The following interpretation, however, can be used to get a rough idea of quality of the translations:
BLEU | Interpretation |
---|---|
﹤0.1 | Almost useless |
0.1-0.19 | Hard to get the gist |
0.2-0.29 | The gist is clear, but has significant grammatical errors |
0.3-0.39 | Understandable to good translations |
0.4-0.49 | High quality translations |
0.5-0.59 | Very high quality, adequate, and fluent translations |
≥0.6 | Quality often better than humans |
It's worth noting that BLEU is just one of several metrics used to evaluate machine translation and text generation, and it's often used alongside other evaluation methods for a more comprehensive assessment of model performance.
Code Labs Academy’s Data Science & AI Bootcamp equips you with the skills to build, deploy, and refine machine learning models, preparing you for a world where AI is revolutionizing industries.