Apply to our new Data Science and Cybersecurity Part-time cohorts

BLEU Score
Natural Language Processing
Machine Translation Evaluation

Understanding BLEU Score in NLP: Evaluating Translation Quality

Wed Mar 27 2024

Understanding BLEU Score in NLP: Evaluating Translation Quality cover image


BLEU (BiLingual Evaluation Understudy), is a metric used in natural language processing (NLP) and machine translation to evaluate the quality of generated text against one or more high quality reference translations. It measures how similar a machine-generated text is to one or more human-generated reference texts. BLEU works by comparing n-grams (sequences of n consecutive words) between the generated text and the reference texts. It calculates precision, considering how many n-grams in the generated text match those in the reference text(s). The precision score is then modified by a brevity penalty to avoid favoring shorter translations. The BLEU score is known to correlate well with human judgment on translation quality.

The formula for calculating the BLEU score involves precision and a brevity penalty. Here's a simplified version of the formula:


  • BP is the brevity penalty to account for the length of the generated text compared to the reference text(s).
  • n is the maximum n-gram order considered (usually 4).
  • P_i is the precision of the i-gram between the generated text and the reference text(s).

The precision p_i for each i-gram is calculated by dividing the number of matching i-grams in the generated text by the total number of i-grams in the generated text. This precision value is multiplied together for all i-gram orders and then raised to the power of the reciprocal of n (the maximum n-gram order).

The brevity penalty (BP) penalizes shorter translations by comparing the length of the generated text with the closest reference text in terms of length. It's calculated as:


  • c is the length of the generated text
  • r is the length of the closest reference text

This penalty prevents overly short translations from receiving disproportionately high scores. Keep in mind that this is a simplified explanation of the BLEU score formula. The actual computation might involve additional smoothing techniques or modifications for specific variations of BLEU used in different contexts.


The BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the generated text and the reference text(s). Higher BLEU scores generally suggest better translation quality, but it's essential to consider its limitations, such as not accounting for semantic meaning or fluency. The BLEU score is not an absolute measure, and comparing BLEU scores between passages, languages, or even in the same language with different numbers of reference translations (the more translations, the more likely to match the candidate n-grams) is not accurate.

The following interpretation, however, can be used to get a rough idea of quality of the translations:

﹤0.1Almost useless
0.1-0.19Hard to get the gist
0.2-0.29The gist is clear, but has significant grammatical errors
0.3-0.39Understandable to good translations
0.4-0.49High quality translations
0.5-0.59Very high quality, adequate, and fluent translations
≥0.6Quality often better than humans

It's worth noting that BLEU is just one of several metrics used to evaluate machine translation and text generation, and it's often used alongside other evaluation methods for a more comprehensive assessment of model performance.

Career Services background pattern

Career Services

Contact Section background image

Let’s stay in touch

Code Labs Academy © 2024 All rights reserved.