edit distance to a miracle cure

Edit Distance: Not a Miracle Cure

With the current rapid developments in Neural Machine Translation (NMT), discussions on its market impact are gathering pace particularly around post-editing. While some suggest that paying by the hour – rather than the traditional per-word rate – is the way to go, others feel that translators should only be paid for actual changes to the MT output. This second option is usually operationalized by using what’s known as an edit distance metric.
Edit distance metrics have also been suggested as a means to replace or support BLEU (bilingual evaluation understudy), which is an algorithm for evaluating the quality of text that has been machine-translated from one natural language to another and is used for assessing MT quality - thus support MT engine development. These discussions are not necessarily new; however, edit distances have not become mainstream as of yet. Why? The apparent ease of this solution does in fact hide a lot of complications. Let’s take a closer look.

What is edit distance?

There are various ways to measure edit distance, with Levenshtein distance and TER(p) among the best-known. In essence, all edit distance metrics work on the same principle: they measure the minimal number of edits necessary to change one string (in this case, the MT output) into another string (in this case, the final translation). An edit can be an insertion, a deletion or a substitution.

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of "s" for "k")
  2. sitten → sittin (substitution of "i" for "e")
  3. sittin → sitting (insertion of "g" at the end)
For many purposes, it makes a difference if there are 3 edits in a string of 3 words or in a string of 13 words. Hence the number of edits is often divided by the string length, in which case a number between 0 and 1 results (usually displayed as a number between 0 and 100). That makes edit distances more comparable among strings of different lengths.

Can edit distance help to measure MT quality?

Let’s take a look at the scenario of improving MT engines. In order to do so, developers need a way to tell which of a set of engines with different settings or technologies performs best. In order to simplify the task, we look at a post-editing use case only. Let’s assume that the post-editors have performed all and only necessary changes to the MT output. On that basis, it can be assumed that the engine that required the least edits to the same set of source sentences is the most suited to the post-editing use case.

However, for practical purposes, this may not help much for MT development, especially as a replacement of BLEU score. The problem here is that for rapid engine development, tests must be quickly repeatable and automated. This only works if you have a frozen set of gold standard translations, to which you can compare new MT output automatically. However, most source sentences can have more than one valid translation. A good post-editor will go for the one that is closest to the MT output to minimize the editing effort. Freezing one set of final translations as gold standard penalizes MT output from a different system, which would require similar or less PE effort but would result in a different final translation.

An artificial example can illustrate the principle:

  • Gold standard translation: The cat bit the dog.
  • MT1: The cat hit the dog. Edit distance to gold standard: 1.
  • PE1: The cat bit the dog. Edit distance from MT1 to PE1: 1.
  • MT2: The dog was bitten by the cat. Edit distance to gold standard: 16
  • PE2: The dog was bitten by the cat. Edit distance from MT2 to PE2: 0
  • MT2 is closer to the original meaning than MT1, and would require less post-editing effort in a live scenario, but is penalized because it chooses a different construction than the gold standard.

In the linguistic sense, edit distance provides most information when new MT output is post-edited every time. While this is technically possible, it is so labor-intensive and time-consuming that it is not a practical solution for the purpose of engine development.

Can edit distance be used to measure post-editing effort?

Another way to utilize edit distance is as an indication of post-edit effort. The reasoning is that fewer edits indicate less effort on the post-editor’s part – and hence the payment should be less, so as to only pay for work completed. This is not entirely true though, which becomes clear when the translation and post-editing tasks are broken down into their parts: 


When the MT output is closer to a good translation, fewer edits are necessary to arrive at the final translation than when the final translation has to be typed from scratch. That means that the “typing” step marked in red can be done faster. However, all other steps in the overall task still need doing, and the yellow step is in fact added. When for post-editing only 7 edits are needed while for translating 14 typing actions are required, this does not mean that post-editing overall is twice as fast. It only means that the typing task entails 50% fewer key strokes. The overall post-editing job would be less than 50% faster – but how much less cannot be determined from the edit distance alone.

Another example may clarify that. Let’s assume that work is being done for a client in the life sciences industry with a customized MT system. A typical source may look like this:

CELLine™ 1000, a membrane-based, disposable cell cultivation system, guarantees high cell densities and is easy to use for recombinant protein expression and high yield monoclonal antibody (MAb) production.

Given the need for exactness in this domain, the post-editor will need to validate all underlined technical terms in the TM and/or the term base. If the MT system provides the correct translation, no changes are necessary and edit distance will be 0. Does that mean that no work has been done to this segment?The proposal to pay for post-editing by edit distance is based on a misconception: the idea that post-editing (and translation in general) is nothing more than typing. The key skill is to know what to type, and as a text becomes more specialized, the post-editor’s time is increasingly spent on validating meaning and less on actual keystrokes.

Post-edit effort is determined by a combination of the domain, the post-editor’s experience, the MT output quality and the client’s demands on the final delivery. Edit distance can only measure a part of this effort; it is not a one-on-one indicator of total work done.

Of course there are ways to work around this restriction (e.g., add a fixed cost per word, pay a premium for certain domains, etc.), but such additional workarounds make the system more complex, and there is no industry-wide agreement on them yet.

While MT technology has no direct bearing on evaluation standards and metrics, the quality improvements that we are seeing with NMT have put existing evaluation methods back under scrutiny. Edit distance metrics are a great tool to measure the delta between MT output and final translation, but in order to draw any conclusion from these numbers, their context and conditions need to be understood fully. Edit distances are not the be all and end all on their own, but work best when embedded in a carefully designed process – not that different from BLEU scores really.

If you'd like to find out more about this topic, download the best practices for Machine Translation Post-editing whitepaper.