Statistical Machine Translation (SMT) learns how to translate by analyzing existing human translations (known as bilingual text corpora). In contrast to the Rules Based Machine Translation (RBMT) approach that is usually word based, most modern SMT systems are phrased-based and assemble translations using overlap phrases. In phrase-based translation, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called phrases, but typically are not linguistic phrases, but phrases found using statistical methods from bilingual text corpora.
Analysis of bilingual text corpora (source and target languages) and monolingual corpora (target language) generates statistical models that transform text from one language to another with that statistical weights are used to decide the most likely translation.
Clean Data SMT and SMT in general require very large bilingual corpora. One of the major advantages of the Language Studio platform is the extensive collections of bilingual corpora, known as Foundation Data, in more than 550 language pairs and is progressively adding a set of 13 Foundation Domains to each language pair. When combined with ADM, the data requirements are greatly reduced and it is even possible to create custom engines when no bilingual corpora are available.
Language Studio has the Clean Data SMT Model as its data approach and Deep Neural (Deep NMT) and Statistical Machine Translation (SMT) technologies at its core. It provides a Hybrid Machine Translation Model that integrates the strengths of traditional SMT with Deep NMT to produce a notably higher quality translation than either technology alone.