|
|
Line 1: |
Line 1: |
| {{About|the evaluation metric for machine translation|other meanings|Bleu (disambiguation)}}
| | I would like to introduce myself to you, I am Jayson Simcox but I don't like when individuals use my complete title. The favorite hobby for him and his children is to play lacross and he'll be starting something else alongside with it. My day occupation is a travel agent. My spouse and I live in Mississippi but now I'm contemplating other options.<br><br>my web blog - clairvoyance ([http://help.ksu.edu.sa/node/65129 their website]) |
| __NOTOC__
| |
| '''BLEU''' ('''Bilingual Evaluation Understudy''') is an algorithm for evaluating the quality of text which has been [[machine translation|machine-translated]] from one [[natural language]] to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" - this is the central idea behind BLEU.{{ref|Papineni2002a}}{{ref|Papineni2002b}} BLEU was one of the first [[Metric (mathematics)|metrics]] to achieve a high [[correlation]] with human judgements of quality,{{ref|Papineni2002b}}{{ref|Coughlin2003a}} and remains one of the most popular automated and inexpensive metrics.
| |
| | |
| Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole [[text corpus|corpus]] to reach an estimate of the translation's overall quality. Intelligibility or grammatical correctness are not taken into account.
| |
| | |
| BLEU is designed to approximate human judgement at a corpus level, and performs badly if used to evaluate the quality of individual sentences.
| |
| | |
| BLEU’s output is always a number between 0 and 1. This value indicates how similar the candidate and reference texts are, with values closer to 1 representing more similar texts. However, few human translations will attain a score of 1. The candidate texts must be identical to a reference translation. For this reason, it is not necessary to attain a score of 1. Because there are more opportunities to match, adding additional reference translations will increase the BLEU score.{{ref|Papineni2002a}}
| |
| | |
| ==Algorithm==
| |
| | |
| BLEU uses a modified form of [[Precision (information retrieval)|precision]] to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than are in a reference text. This is illustrated in the following example from Papineni et al. (2002),
| |
| | |
| {|class=wikitable
| |
| |+ Example of poor machine translation output with high precision
| |
| |-
| |
| | Candidate || the || the || the || the || the || the || the
| |
| |-
| |
| | Reference 1 || the || cat || is || on || the || mat
| |
| |-
| |
| | Reference 2 || there || is || a || cat || on || the || mat
| |
| |-
| |
| |}
| |
| | |
| Of the seven words in the candidate translation, all of them appear in the reference translations. Thus the candidate text is given a unigram precision of,
| |
| | |
| :<math>P = \frac{m}{w_{t}} = \frac{7}{7} = 1</math>
| |
| | |
| where <math>~m</math> is number of words from the candidate that are found in the reference, and <math>~w_{t}</math> is the total number of words in the candidate. This is a perfect score, despite the fact that the candidate translation above retains little of the content of either of the references.
| |
| | |
| The modification that BLEU makes is fairly straightforward. For each word in the candidate translation, the algorithm takes its maximum total count, <math>~m_{max}</math>, in any of the reference translations. In the example above, the word "the" appears twice in reference 1, and once in reference 2. Thus <math>~m_{max} = 2</math>. | |
| | |
| For the candidate translation, the count <math>m_{w}</math> of each word is clipped to a maximum of <math>m_{max}</math> for that word. In this case, "the" has <math>~m_{w} = 7</math> and <math>~m_{max}=2</math>, thus <math>~m_{w}</math> is clipped to 2. <math>~m_{w}</math> is then summed over all words in the candidate.
| |
| This sum is then divided by the total number of words in the candidate translation. In the above example, the modified unigram precision score would be:
| |
| | |
| :<math>P = \frac{2}{7}</math>
| |
| | |
| In practice, however, using individual words as the unit of comparison is not optimal. Instead, BLEU computes the same modified precision metric using [[n-gram]]s. The length which has the "highest correlation with monolingual human judgements"{{ref|Papineni2002c}} was found to be four. The unigram scores are found to account for the adequacy of the translation, how much information is retained. The longer <math>n</math>-gram scores account for the fluency of the translation, or to what extent it reads like "good English".
| |
| | |
| Another problem with BLEU scores is that they tend to favor short translations, which can produce very high precision scores, even using modified precision. An example of a candidate translation for the same references as above might be:
| |
| | |
| :the cat
| |
| | |
| In this example, the modified unigram precision would be,
| |
| | |
| :<math>P = \frac{1}{2} + \frac{1}{2} = \frac{2}{2}</math>
| |
| | |
| as the word 'the' and the word 'cat' appear once each in the candidate, and the total number of words is two. The modified bigram precision would be <math>1 / 1</math> as the bigram, "the cat" appears once in the candidate. It has been pointed out that precision is usually twinned with [[Recall (information retrieval)|recall]] to overcome this problem {{ref|Papineni2002d}}, as the unigram recall of this example would be <math>2 / 6</math> or <math>2 / 7</math>. The problem being that as there are multiple reference translations, a bad translation could easily have an inflated recall, such as a translation which consisted of all the words in each of the references.{{ref|Papineni2002e}}
| |
| | |
| In order to produce a score for the whole corpus the modified precision scores for the segments are combined, using the [[geometric mean]] multiplied by a brevity penalty to prevent very short candidates from receiving too high a score. Let <math>r</math> be the total length of the reference corpus, and <math>c</math> the total length of the translation corpus. If <math>c \leq r</math>, the brevity penalty applies, defined to be <math>e^{(1-r/c)}</math>. (In the case of multiple reference sentences, <math>r</math> is taken to be the sum of the lengths of the sentences whose lengths are closest to the lengths of the candidate sentences. However, in the version of the metric used by [[NIST (metric)|NIST]] evaluations prior to 2009, the shortest reference sentence had been used instead.)
| |
| | |
| iBLEU is an interactive version of BLEU that allows a user to visually examine the BLEU scores obtained by the candidate translations. It also allows comparing two different systems in a visual and interactive manner which is useful for system development.{{ref|Madnani2011}}
| |
| | |
| ==Performance==
| |
| | |
| BLEU has frequently been reported as correlating well with human judgement,{{ref|Papineni2002f}}{{ref|Coughlin2003b}}{{ref|Doddington2002a}} and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that although in principle capable of evaluating translations of any language, BLEU cannot in its present form deal with languages lacking word boundaries.{{ref|Denoual2005a}}
| |
| | |
| It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.{{ref|Callison2006a}} Nevertheless, they highlight two instances where BLEU seriously underperformed. These were the 2005 [[NIST]] evaluations{{ref|Lee2005a}} where a number of different machine translation systems were tested, and their study of the [[SYSTRAN]] engine versus two engines using [[statistical machine translation]] (SMT) techniques.{{ref|Callison2006b}}
| |
| | |
| In the 2005 NIST MT evaluation, it is reported that the scores generated by BLEU failed to correspond to the scores produced in the human evaluations. The system which was ranked highest by the human judges was only ranked 6th by BLEU. In their study, they compared SMT systems with SYSTRAN, a knowledge based system. The scores from BLEU for SYSTRAN were substantially worse than the scores given to SYSTRAN by the human judges. They note that the SMT systems were trained using BLEU minimum error rate training,{{ref|Och2004a}} and point out that this could be one of the reasons behind the
| |
| difference. They conclude by recommending that BLEU be used in a more restricted manner, for comparing the results from two similar systems, and for tracking "broad, incremental changes to a single system".{{ref|Callison2006c}}
| |
| | |
| There is an inherent, systemic problem with BLEU: in real life, sentences can be translated in many different ways, sometimes with 0 overlap. Therefore, the approach of comparing by how much a given translation result by a computer differs from human translation is flawed.
| |
| | |
| ==See also==
| |
| * [[F1 Score|F-Measure]]
| |
| * [[NIST (metric)]]
| |
| * [[METEOR]]
| |
| * [[ROUGE (metric)]]
| |
| * [[Word error rate|Word Error Rate (WER)]]
| |
| * [[Noun-Phrase Chunking]]
| |
| * [[TER]] http://web.jhu.edu/sebin/w/e/terplusdorr.pdf
| |
| ==Notes==
| |
| {{Refbegin}}
| |
| # {{note|Papineni2002a}} Papineni, K., et al. (2002)
| |
| # {{note|Papineni2002c}} Papineni, K., et al. (2002)
| |
| # {{note|Coughlin2003a}} Coughlin, D. (2003)
| |
| # {{note|Papineni2002d}} Papineni, K., et al. (2002)
| |
| # {{note|Papineni2002e}} Papineni, K., et al. (2002)
| |
| # {{note|Papineni2002f}} Papineni, K., et al. (2002)
| |
| # {{note|Papineni2002f}} Papineni, K., et al. (2002)
| |
| # {{note|Coughlin2003b}} Coughlin, D. (2003)
| |
| # {{note|Doddington2002a}} Doddington, G. (2002)
| |
| # {{note|Denoual2005a}} Denoual, E. and Lepage, Y. (2005)
| |
| # {{note|Callison2006a}} Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
| |
| # {{note|Lee2005a}} Lee, A. and Przybocki, M. (2005)
| |
| # {{note|Callison2006b}} Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
| |
| # {{note|Och2004a}} Lin, C. and Och, F. (2004)
| |
| # {{note|Callison2006c}} Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
| |
| # {{note|Madnani2011}} Madnani, N. (2011)
| |
| {{Refend}}
| |
| | |
| ==References==
| |
| {{refbegin}}
| |
| * {{cite conference | last1 = Papineni | first1 = K. | last2 = Roukos | first2 = S. | last3 = Ward | first3 = T. | last4 = Zhu | first4 = W. J. | year = 2002 | title = BLEU: a method for automatic evaluation of machine translation | id = {{citeseerx|10.1.1.19.9416}} | conference = ACL-2002: 40th Annual meeting of the Association for Computational Linguistics | pages = 311–318 }}
| |
| *Papineni, K., Roukos, S., Ward, T., Henderson, J and Reeder, F. (2002). “Corpus-based Comprehensive and Diagnostic MT Evaluation: Initial Arabic, Chinese, French, and Spanish Results” in Proceedings of Human Language Technology 2002, San Diego, pp. 132–137
| |
| * Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "[http://www.cs.jhu.edu/~ccb/publications/re-evaluating-the-role-of-bleu-in-mt-research.pdf Re-evaluating the Role of BLEU in Machine Translation Research]" in ''11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006'' pp. 249–256
| |
| * Doddington, G. (2002) "[http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf Automatic evaluation of machine translation quality using n-gram cooccurrence statistics]" in ''Proceedings of the Human Language Technology Conference (HLT), San Diego, CA'' pp. 128–132
| |
| * Coughlin, D. (2003) "[http://www.mt-archive.info/MTS-2003-Coughlin.pdf Correlating Automated and Human Assessments of Machine Translation Quality]" in ''MT Summit IX, New Orleans, USA'' pp. 23–27
| |
| * Denoual, E. and Lepage, Y. (2005) "[http://www.mt-archive.info/IJCNLP-2005-Denoual.pdf BLEU in characters: towards automatic MT evaluation in languages without word delimiters]" in ''Companion Volume to the Proceedings of the Second International Joint Conference on Natural Language Processing'' pp. 81–86
| |
| * Lee, A. and Przybocki, M. (2005) NIST 2005 machine translation evaluation official results
| |
| * Lin, C. and Och, F. (2004) "[http://www.mt-archive.info/ACL-2004-Lin.pdf Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics]" in ''Proceedings of the 42nd Annual Meeting of the Association of Computational Linguistics''.
| |
| * Madnani, N. (2011). "[http://www.computer.org/portal/web/csdl/doi/10.1109/ICSC.2011.36 iBLEU: Interactively Scoring and Debugging Statistical Machine Translation Systems]{{dead link|date=July 2013}}" in "Proceedings of the Fifth IEEE International Conference on Semantic Computing (Demos), Palo Alto, CA" pp. 213–214
| |
| {{refend}}
| |
| | |
| {{good article}}
| |
| | |
| [[Category:Evaluation of machine translation]]
| |