Omniscien Technologies » FAQ » What is Statistical Machine Translation (SMT)?

What is Statistical Machine Translation (SMT)?

Statistical Machine Translation (SMT) learns how to translate by analyzing existing human translations (known as bilingual text corpora). In contrast to the Rules-Based Machine Translation (RBMT) approach that is usually word-based, most modern SMT systems are phrase-based and assemble translations using overlap phrases. In phrase-based translation, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called phrases, but typically are not linguistic phrases, but phrases found using statistical methods from bilingual text corpora. 

Analysis of bilingual text corpora (source and target languages) and monolingual corpora (target language) generates statistical models that transform text from one language to another with that statistical weights are used to decide the most likely translation.

Clean Data Machine Translation and machine translation in general require very large bilingual corpora. One of the major advantages of the Language Studio platform is the extensive collections of bilingual corpora, known as Industry Data, in more than 600 language pairs and is progressively adding a set of 14 Industry Domains to each language pair. When combined with ADM, the data requirements are greatly reduced and it is even possible to create custom engines when no bilingual corpora are available.

Language Studio has the Clean Data Machine Translation Model as its data approach and Deep Neural (Deep NMT) and Statistical Machine Translation (SMT) technologies at its core. It provides a Hybrid Machine Translation Model that integrates the strengths of traditional SMT with Deep NMT to produce a notably higher quality translation than either technology alone.

At Omniscien, we firmly believe that SMT still has a considerable role to play in the machine translation space. SMT produces higher quality translations than NMT in a number of scenarios. For this reason we create a unique Hybrid NMT/SMT model that leverages the strengths of both technologies.

Behind many of the tools design is Omniscien’s Chief Scientist, Professor Philipp Koehn who leads our team of researchers and developers. Philipp is a pioneer in the machine translation space, his books on Statistical Machine Translation and Neural Machine Translation are the leading academic textbooks globally on machine translation. Both books are available now from or leading book stores.

Philipp Koehn

Professor Philipp Koehn,
Chief Scientist,
Omniscien Technologies.

Statistical Machine Translation - Philipp Koehn - Book Cover
Neural Machine Translation - Philipp Koehn - Book Cover

Professor Koehn has also been the instigator and driving force behind the popular open-source Moses Statistical Machine Translation software. Philipp along with a large number of others in the open source community have built Moses into a robust platform that is the defacto platform for SMT research. Included on the Moses website is a great article on the background of SMT.