If you are interested in attending our upcoming webinars, please click here to sign up.

Frequently Asked Questions

Machine Translation Basics (MT 101)

  • Is Cloud based MT secure?

    Security is a broad term and the level of security will differ based on the supplier and deployment model. In addition cloud players such as Google have legal terms and conditions wherein they claim rights to the data, specifically when free services are used. All these considerations are relevant in a security context.

    Omniscien Technologies recognizes that your translation memories and other linguistic assets are very valuable. These linguistic assets provide a unique competitive advantage by offering greater accuracy, efficiency and productivity. Our corporate culture and infrastructure are designed with security and privacy at their core.

    Omniscien Technologies takes data security very seriously. Omniscien Technologies hosts its Cloud Service in a secure data center in Europe and confirms the data security policies in its T&C.

    Language Studio™ infrastructure is secured in a dedicated private network infrastructure that is not shared with other companies and features world class connectivity and security. Data is stored in a secure file system with 2048 bit encryption. Each client’s data is stored independently to ensure that there is no possibility of data leakage.

    A very detailed contract provides clients protection so that data provided will only be used for the purpose of building and improving the client’s custom engines. Omniscien Technologies does not share your data with any other clients and do not use your data to build other translation engines. Omniscien Technologies service contracts offer even greater protection than is typically found in contracts that Language Service Providers (LSPs) have with second-tier LSPs and individual translators.

    Custom Engines are private to each client. Once a custom engine has been created for a client, the engine can only be accessed by the client or those authorized by the client. Clients control who can access custom engines by using the Language Studio™ web interface to manage permissions and security.

  • What Return on Investment (ROI) and other benefits should I expect?

    Machine translation has come of age and large numbers of Language Service Providers (LSPs) and enterprises are now integrating machine translation into their business. However, few are achieving their full potential and Return On Investment (ROI) is often illusive.

    ROI is a performance measure used to evaluate the efficiency of an investment or to compare the efficiency of a number of different investments. Total Cost of Ownership (TCO) measurements provide a solid foundation for calculating ROI. This presentation provides a clear process to calculate ROI and TCO, with a focus on what can be done to reduce the time to achieve ROI through rapid quality improvement when customizing machine translation engines.

    Case Studies have shown that users can expect an increase in profit margins coupled with a decrease in required human resource costs. Overall costs decrease over time as machine translation engines mature, requiring less post-editing effort. The diagram demonstrates the difference in profit margin as the quality of the machine translation engine progressively improves. Some users have recovered all their machine translation investment costs after their very first project.

    When selecting a machine translation vendor, many organizations look at the cost of machine translation without fully understanding the total costs to deliver the final translation output and potential savings to be gained. Lower cost machine translation does not necessarily mean higher ROI. The largest costs in translation are those that involve humans. As such, the best ROI will be provided when there is less human effort required. High quality fully customized machine translation that has been adapted to the client’s terminology, vocabulary and desired writing style using the Clean Data SMT model, greatly reduces human effort and costs while increasing productivity by as much as 300%. Generic machine translation such as that from Google Translate and Do It Yourself machine translation customization delivers lower quality translations resulting in higher human effort and cost, thus lower ROI.

  • What are the common misconceptions about MT?

    There are many misconceptions about machine translation. Some come from the experience that people have had historical legacy machine translation products and tools. Others come from translators that do not yet fully understand the technology and are afraid that machine translation threatens to replace them. Some of the common misconceptions are listed below. All are FALSE.

    False

    • Machine translation will be useful in the future, but we’re not there yet.
    • Machine translation quality is so bad, that any translations done by machine are useless.
    • It is faster to translate by hand than to edit machine tranlation output.
    • Machine Translation will replace human translators.
    • Anybody with a PC and a browser can download Moses and build an MT engine.
    • General Purpose “Free” Online machine translation is the same as carefully built custom machine translation engines.
    • Instant machine translation Solutions are the same at carefully customized engines.
    • Rules-based machine translation is better than Statistical machine translation or vice versa.
  • How does machine translation fit into my translation workflow?

    Language Studio™ is integrated with the leading Translation Memory Systems, Translation Management Systems, and Content Management Systems – leading to seamless workflow integration. APIs are freely available for integration with other third-party platforms.

    Language Studio™ is designed to integrate at the pre-translate stage, immediately after or as part of the same stage where translation memories are matched to the source text being translated. Integrating at this early stage means that the existing translator workflow does not change and is almost identical to post-editing a translation memory fuzzy match.

    Example Integration of Translation Workflow Integration

    workflowintegration

     

    Options:

    Option 1: Translate via your preferred TMS

    1. Perform a translation memory match within your preferred TMS platform.
    2. A typical fuzzy score of 85% or greater is used as raw MT will often be better than a lower scoring fuzzy match.
    3. Send unmatched segments to machine translation. TMS platforms integrated with Language Studio™ can do this as a single click or automated immediately after the translation memory match.

    Option 2: Edit and Proof using your preferred editing tools

    1. Depending on the quality or the MT output, raw MT output can be corrected by a translator or, in the case of high quality MT, can often bypass this translator step and go directly to an editor.
    2. Language Studio™confidence scores can be leveraged to determine whether segments should be reviewed by a translator or an editor.
    3. Different projects have different budgets and quality requirements. This should be taken into account when defining the workflow for a project.
    4. After editing and proofing, the high quality segments for a given project can be applied to the custom machine translation engine so that next time it is use, the lessons learned from these edits will be applied, raising quality of future translations. This is usually just prior to publication.
  • Can machine translation handle complex formatting?

    The answer to this question depends on the machine translation platform being used. Language Studio™ handles complex formatting embedded in documents such as Microsoft Word, PowerPoint or standard XML interchange formats such as XLIFF or TMX.

    Many machine translation tools and platforms cannot handle complex formatting. There is a common misconception that RBMT can handle formatting, but SMT cannot. This is totally false, but this myth continues to be perpetrated through the market by LSPs that have a vested interest in legacy RBMT tools and have not moved to more advanced and modern platforms.

    Open source Moses does not handle complex formatting. Anyone adopting this platform needs to write their own software to manage complex formatting for manually reinsert the formatting after the translation is complete.

    In addition to handling complex formatting, Language Studio™ also supports XML markup to guide quality and control the translations at runtime. This enables sections of documents to be translatated in a specific writing style, ignoring of some content and control over reordering. Language Studio™ even provides a JavaScript layer so that you can customize and control translations at runtime.

  • Why is customised machine translation better than generic machine translation?

    Generic machine translation engines like Google Translate or Bing Translator are designed to translate anything with no specific domain bias or focus. The resulting translation is often out of context and lacks any specific writing style. Terminology and vocabulary choice are often inconsistent. Worse still, the terminology can often be completely wrong as it was determined by what was which data was most statistically relevant.

    To demonstrate this concept, consider the simple sentence “I went to the bank.” A generic translation engine will not know if the context is “I went to the bank (of the river)” or “I went to the bank (of the turn)” or “I went to the bank (to deposit money)” or any of a number of other possible contexts. As such, the most statistically relevant choice will be taken, which in many cases will be wrong.

    From the single sentence above, even a human translator cannot determine the context, so it is reasonable that a machine would also have difficulty. However, if a human translator were given the information that they were translating in the Banking and Finance domain, then the accuracy of the translation would be increased. Customization of a machine translation engine is similar in that an engine is trained in a specific domain so has additional contextual information that improves translation accuracy.

    There are different levels of customization and different approaches to customization. The full machine translation customization approach provided by Language Studio™ incorporates the Clean Data SMT model and is the most advanced in the industry. This approach ensures the context, preferred terminology, vocabulary choice and writing style are incorporated into the machine learning process and thus deliver optimal machine translation quality output.

  • Are human translations always better than Machine Translations (MT)?

    While generally raw machine translation is not as good as human translation, there are a number of cases where machine translation can actually be better than human translators.

    In the LexisNexis Univentio case study, 13 million patents were translated from Japanese to English. Machine translated patents resulted in a 20% increase in meaningful search results when compared to human translations of patents. Machines translation delivered a significant and meaningful amount of added value.

    Had the LexisNexis Univentio project been translated by human translators it would have taken 152,257 person years of effort and cost in excess of US$ 40 billion. This project would have never been attempted due to time and cost requirements beyond thousands of times beyond the business value that such a task would deliver. With Language Studio™ this task was successfully completed in a fraction of this time and cost. LexisNexis Univentio has subsequently repeated the process for many other languages.

    Omnilingua Worldwide, a Language Service Provider (LSP) that used Language Studio™ to translate automotive content for a USA car manufacturer, received a call from the end client after delivering asking what had changed. The client noted that the quality of the final delivered product was notably better than historical translations. Previously translation had originally been performed for this client with a human only approach and later with a competitor’s legacy machine translation engine. Terminology and writing style was more consistent than in the past with a combined Language Studio™ custom machine translation engine plus human post-editing approach.

    With human translators, when more than one translator is involved, there will always be a variation based on the individual preferences of each translator. With Language Studio™ terminology and writing style was normalized within the custom translation engine as part of the Clean Data SMT model advocated by Omniscien Technologies. This ensured consistency and reduced the amount of human editing required.

    The above examples are just two of many where machine translation either with or without a human post-editing the content has delivered results that would exceed that of a human only approach. The important thing to determine is what quality level is required and what can be achieved with machine only or machine plus human. Some content is better suited for this purpose that others. The appropriate level of human involvement can then be determined in order to deliver the required quality level.

  • Will Machine Translation put human translators out of work?

    No. There will always be a need for premium content that only humans could do. However, much of the repetitive mundane content will be well suited to MT.

    Automated translation is not a designed to replace human translators, it is intended to work alongside them, and improving their efficiency and allowing translation of content that could not be translated by human alone due to cost and time constraints. A machine + human collaboration can achieve results that neither could achieve alone and it opens up new horizons for the types of content that can be translated.

  • What Translation Management Systems (TMS) are supported?

    Language Studio™ is supported by a large number of third-party Translation Management Systems (TMS) and workflow tools.

  • What document types are supported?

    Language Studio™ supports most common translation formats such as XLIFF and TMX, in addition to common working formats such as Microsoft Office, text and HTML. A comprehensive list of Supported Document Formats is available for review.

    If you use a Translation Management System (TMS), then additional formats may be supported be supported by your TMS, with your TMS communicating with the Language Studio™ API using XLIFF. Your TMS will handle the transformations between various formats automatically. Language Studio™ is supported by a wide variety of third-party TMS platforms. See partners section for more details.

  • What is the difference between “direct language pairs” and “language pair pivoting”?

    Language pair pivoting uses a middle language (i.e. German English … English French) to bridge the gap between language pairs. Direct language pairs go directly from one language to another (i.e. German French). Unlike many other competing translation technologies, all of the Language Studio™ language pairs are direct language pairs. Direct language pairs will always give higher quality translations than pivoted language pairs.

  • What domains are supported?

    Many MT systems will offer domains such as Automotive at a high level. With Language Studio™ we do not see this high level of focus as enough to deliver near-human quality automated translation output. A fully customized engine within Language Studio™ will support granular sub-domains that focus on writing style, preferred vocabulary and target audience. For example Automotive Marketing is very different to Automotive Technical Service Manuals and Automotive User Manuals.

    Building and expanding Omniscien Technologies constantly evolves its Foundation Data (Industry Engines), improving the quality and vocabulary range supported by Language Studio™ and is progressively building and expanding Industry Engines data in 12 domains in order to make it easier for any of our customers to start building their own high quality custom machine translation engine.

    The following Industry Domains are being progressively developed and expanded upon:

    • Automotive
      Workshop manuals, training manuals, technical papers, user manuals, marketing documentation
    • Banking and Finance
      Product marketing, communications, research, regulatory documentation
    • eDiscovery
      Processing of Electronically Stored Information (ESI) from a range of ‘native’ formats for legal or data mining purposes
    • Engineering and Manufacturing
      Process documentation. product literature, research materials, regulatory documents, training manuals, technical papers and user manuals
    • General
      General purpose, not domains specific. Use this industry engine if none of the other Industry Domains match your desired content domain. This domain provides a solid base upon top of which any other domain can be built.
    • Information Technology
      User manuals, technical manuals, technology marketing, software strings, product briefs, knowledge base.
    • Life Sciences
      Medical, Healthcare, Informed consent forms (ICFs), Instructions for use (IFUs), Patient-facing materials, Product literature, Research materials, Regulatory documents, Training manuals, Technical papers and User manuals
    • Military and Defense
      Aerospace, military, defense and information sources of interest to national security
    • News and Media
      News, publications, commentary, blogs and freeform text, and media such as song lyrics and subtitles
    • Patents and Legal
      Patents, intellectual property, legal contracts, commercial terms, licenses, policies, deeds
    • Politics and Government
      Policy, governmental proceedings, eGovernment documentation
    • Retail and eCommerce
      Product Catalog, Customer Support, Retail Marketing
    • Travel and Hospitality
      Travel, hotel and restaurant related descriptions, reviews, menus and bookings
  • What Languages are supported by Machine Translation?

    Different machine translation tools support different language pairs. With more than 540 language pairs supported, Language Studio™ has more language pair combinations available than any other commercial machine translation platform. New language pairs are added on a regular basis, with over 200 language pairs currently in development.

    As you would expect, Language Studio™ supports all major languages. Many of the languages supported are listed below. If the language pair you require is not already supported, talk to the Omniscien Technologies team about our language pair development schedule.

    Arabic
    Bahasa Indonesian
    Brazillian Portuguese
    Bulgarian
    Chinese
    Czech
    Danish
    Dutch
    English
    Estonian
    Finnish
    French
    German
    Greek
    Hindi
    Hungarian
    Irish
    Italian
    Japanese
    Korean
    Latvian
    Lithuanian
    Maltese
    Polish
    Portuguese
    Romanian
    Russian
    Slovak
    Slovene
    Spanish
    Swedish
    Thai
    Turkish

    Language Studio™ also supports language variants for all the above languages where appropriate. The latest information about languages currently supported by Language Studio™ and language pairs in development is always available on our Supported Languages page.

  • How fast is Machine Translation?

    The speed of machine translation varies depending on the many factors such as the size of the engine, the language pair (more complex languages such as Chinese and Japanese have additional processing requirements) and the format of the document being translated.

    With Language Studio™ machine translation speed can be configured to match your requirements and ranges 3,000 words per minute upwards. Some of our customers are translating more than 1 billion words per day.

    While estimates vary, the general consensus is that the total number of professional translators in the whole world is around 300,000. If all of these translators were to work on the same day and translated at an average speed of 2,500 words per day, the total combined output per day would be 750 million words. Thus, all professional translators combined would deliver an output volume that is the equivalent to just 75% of the daily output from one of Omniscien Technologies’s larger customers. Clearly it is not practical to edit text when it is produced in this volume, but editing is not always required as described in the answer to the question “Who uses machine translation?

  • How can I translate?

    Machine tanslation and automated translation tools take a variety for forms such as desktop tools, mobile phone applications and Internet (cloud computing) based Software as a Service (SaaS) Solutions and server based Solutions that can be installed in-house to service an entire companies translation needs.

    Standalone desktop tools are usually Rules Based MT (RBMT) and have fewer languages available with less control over quality, writing style, vocabulary choice and target audience than modern hybrid Rules and Statistical platforms such as Omniscien Technologies Language Studio™. Server based Solutions are available on-site, cloud based or SaaS based are available as Rules Based and Statistical Based Solutions. Some are even open source, but as with most open source software have a base level of functionality, but are missing many of the features provided by a fully developed and leading edge translation platform such as Language Studio™ Enterprise.

    Language Studio™ is available as a desktop tool, an on-site server base solution or a cloud based SaaS platform. Developers can leverage the Language Studio™ API to integrate automated translation directly into their workflow. Language Studio™ is also integrated into many third-party workflow and translation management systems (TMS) such as memoQ, XTM, MemSource, SDL and DejaVu. Click here for more information on how to translate with Language Studio™

  • Why do I need Machine Translation?

    According to research firm Common Sense Advisory, 72.1 percent of the consumers spend most or all of their time on sites in their own language, 72.4 percent say they would be more likely to buy a product with information in their own language and 56.2 percent say that the ability to obtain information in their own language is more important than price. These are just a few of the many reasons that translation has become essential in the modern and ever more globalized world that we live in.

    Machine translation is a tool that can help businesses and individuals in many ways. While machine translation is unlikely to totally replace human beings in any application where quality is really important, there are a growing number of cases that show how effective and useful machine translation can be.

    Machine translation can be used on its own or in conjunction with human proofreaders and post-editors. Depending on the nature of content, it is not uncommon to use 3 approaches together (even on one project) – human translation only, machine translation only and a combination of machine translation with human post-editing.

    Machine translation is useful in many areas, including:

    • Highly repetitive content where productivity gains from machine translation can dramatically exceed what is possible with just using translation memories alone (e.g. automotive manuals)
    • Content that is similar to translation memories but not exactly the same (e.g. government policy documents.)
    • Content that would not get translated otherwise due to cost, scale and volume limitations of human translations (e.g. millions of Chinese, Japanese, Korean and other language patents made available in English.)
    • High value content that is changing every hour and every day there is time sensitivity (e.g. stock market news.)
    • Content that does not need to be perfect but just approximately understandable (e.g. any website for a quick review.)
    • Knowledge content that facilitates and enhances the global spread of critical knowledge (e.g. customer support.)
    • Content that is created to enhance and accelerate communication with global customers who prefer a self-service model (e.g. knowledgebase.)
    • Content that would normally be too expensive or too slow to translate with a human only translation approach (e.g. many projects that have insufficient budget for a human only approach.)
    • Increasing the amount of content that can be translated within a budget (e.g. Dell has doubled the amount of content they translate without increasing their translation budget.)
    • Real-time communications where it would not be practical for a human to translate (e.g. chat and email.)

    There is great business value in enabling new and existing content to be translated. When customized, machine translation provides greater efficiency and productivity, lowering costs and increasing Return On Investment (ROI).

  • Who uses Machine Translation?

    Machine translation has evolved to become an essential part of business and individual consumer life. These include:

    • Individuals use generic machine translation systems such as Google Translate for ad hoc translations from their mobile phones and personal computers where quality and data security are not important.
    • Enterprises use customized machine translation for customer support, data mining and sometimes even the publishing of multilingual content where the scale and volume would make it impossible to translate with human translators. Enterprises also leverage machine translation to improve productivity for professional translators.
    • The professional translation industry, known as Language Service Providers (LSPs), use machine translation to increase productivity, while ensuring quality standards are maintained via human post-editors.
    • Law firms use machine translation as part of their electronic discovery (eDiscovery) processes. Machine translation makes vast amounts of content available for analysis that would have been very difficult to process in a variety of languages.
    • Governments use machine translation as part of their intelligence gathering and monitoring activities as well as increasing productivity for human translators when translating policy and other multilingual documents.

    There are many other ways that organizations and individuals are using machine translation. Omniscien Technologies has a series of tailored solutions for different industries, organization type and business functions. Depending on your requirements, Omniscien Technologies can supply the entire solution, or work with our technology partners to deliver a tailored solution that is designed to meet your specific needs.

    Solutions

    Primary Solutions

    • Language Service Providers
    • Multinational Enterprises and Content Publishers
    • Data Mining and Data Processing Enterprises
    • Government and Intelligence Services
    • eDiscovery and Litigation Support
    • Social Media, Mobile Apps, CRM and User Generated Content
    • Consulting Services

    Specialized Solutions

    • Automotive
    • Information Technology
    • Integrated OCR and Machine Translation
    • Life Sciences
    • Patents and Legal
    • Scientific Data and Journals
    • Travel and Hospitality
    • High Volume Bulk Translators
    • Network Business Applications
    • Secure Translation Environments
  • How can Machine Translation be used?

    Machine translation is a powerful tool that has many purposes and can be used in a number of different ways:

    • Translating content quickly to get an understanding of information in another language.
    • Increasing productivity of human translators by providing a first pass translation that can be quickly edited to human quality. Productivity can be increased from between 2,000-3,000 words per day up to 24,000 words per day.
    • Translating content that needs to be translated instantly such as chat, email or customer support communications.
    • Translating content that would normally be too slow or too expensive to translate using a human only approach to translation.
    • Content that may be useful in another language, but where there is insufficient budget or the business case is not strong enough to warrant the investment of human translation.
    • Allowing language service providers to access the next several tiers of information that would not normally be translated due to time and budget constraints.

    In all cases, the quality of machine translation will always be better when a custom engine is used that has been focused on a specific domain. Learn more about the benefits of customizing a machine translation engine. Omniscien Technologies has also published a number of case studies

    Omniscien Technologies has a series of tailored solutions for different industries, organization type and business functions. Depending on your requirements, Omniscien Technologies can supply the entire solution, or work with our technology partners to deliver a tailored solution that is designed to meet your specific needs.

    Solutions

    Primary Solutions

    • Language Service Providers
    • Multinational Enterprises and Content Publishers
    • Data Mining and Data Processing Enterprises
    • Government and Intelligence Services
    • eDiscovery and Litigation Support
    • Social Media, Mobile Apps, CRM and User Generated Content
    • Consulting Services

    Specialized Solutions

    • Automotive
    • Information Technology
    • Integrated OCR and Machine Translation
    • Life Sciences
    • Patents and Legal
    • Scientific Data and Journals
    • Travel and Hospitality
    • High Volume Bulk Translators
    • Network Business Applications
    • Secure Translation Environments
  • What is Machine Translation (MT)?

    Machine Translation (MT) is automated translation from one language to another using computer software.

    Machine translation is often perceived as low quality based an outdated perception created by older translation technologies or freely available generic translation tools from Google or Bing that have not been customized for a specific purpose. Many technology advances have been made in recent years that are changing this perception, with customized machine translation engines like those from Omniscien Technologies’ Language Studio™ delivering near human quality output.

    The history of machine translation dates back to July 1949 when researchers at the Rockefeller Foundation put forward a proposal based on information theory the successes of code breaking during the Second World War and fueled by Cold War fears. Claims were made that computers would replace human translations within 5 years. But solving the automated language translation challenge was a bigger challenge that was understood at the time… and many still believe this today.

    Many technical approaches have been developed to solve the challenge of automated language translation. Three of the more popular approaches are Rules Based Machine Translation (RBMT), Statistical Machine Translation (SMT) and Syntax Based Machine Translation (SBMT). All of these approaches have their strengths and weaknesses. However the latest innovations in machine translation are a hybrid machine translation model combining all these techniques to deliver greater quality and functionality.

    SMT has become the dominant approach to machine translation, with many previous users of RBMT technology replacing their legacy deployments with SMT based solutions. The SMT approach has been progressively refined, but still has many challenges. One of the most common issues reported by users is that terminology and writing style are inconsistent and unpredictable. Omniscien Technologies identified these issues and in 2008 pioneered a new approach to SMT called Clean Data SMT that addresses and resolves a large number of the issues that are inherent in the traditional Dirty Data SMT approach.

    Today automated translation can solve a huge number of translation challenges and increase productivity of translations, while making content that previously could not be translated due to cost, time and size more accessible.

    The Translation Automation User Society (TAUS) has a recently made avaialble a comprehensive history of machine translation.

Neural Machine Translation (NMT)

  • Output words are predicted from the encoding of the full input sentence and all previously produced output words. Do previously translated phrases/segments in the same project predict future segments, or is the encoding/context confined to the segment?

    It is currently restricted to single segment translation. Additional segments could be added as additional conditioning context, yes. There have been some efforts to exploit discourse-level context, but these systems are in early stages.

  • For decoding to occur coding is necessary, but what is the conceptual relationship between coding, decoding and translation?

    The goal of encoding is taking the input sentence and get a representation of the sentence that is more – some would argue – semantics, which captures the meaning of words in context. Taking these encoding of the input sentence the goal now is to decode that into an output sentence. Encoding step first and decoding step afterwards.

  • Typically, the input and output of SMT systems are sentences. This limits SMT systems in modeling discourse phenomena. Could we go beyond sentences with NMT systems?

    It seems definitely straightforward to do it, because we already have kind of embedding that are conditioned on anything previous. So, in general the good news is yes, it is relatively straightforward to add conditioning context when you make predictions.

  • Two objectives were mentioned: Adequacy and Fluency. Where does grammar fit into this and does NMT check grammar rules or only sees sequence of words?

    Grammar is part of fluency. Readers will not think a sentence is fluent when it has grammatical errors.

  • Some researchers proposed full character level NMT to solve the rare word problem in NMT, especially last two months. What is the performance of when using character level NMT?

    They seem to be a decent approach to unknown word translation. In practice, there may be better solutions, e.g., the use of additional terminology dictionaries or special handling of names.

  • Do you think that sequence to sequence gives better results than character based model?

    Currently, character-based models have not been proven to be superior.

  • You mentioned that the machine learns “It’s raining cats and dogs” as a whole phrase. How does it learn something more complex like “Sarah looked up, the sky was bleak but there shouldn’t be any cats and dogs coming out of it.”

    Well, that is a tough one. Theoretically it is possible that the concept “cats and dogs” will be learned to be similar to “rain”, but that is a lot to ask for.

  • When comparing SMT to NMT, what is the difference in the time required to train an engine?

    SMT systems can be typically trained in 1-2 days on regular CPU servers. NMT seems to require 1-2 weeks on GPUs.

  • Is monolingual data still useful in NMT as it is for an SMT language model or is parallel data the only data useful at training time?

    The current basic model only uses parallel data, there are several methods and extensions to the model to accommodate monolingual data, but consensus best practices have not yet been established. This is an area of active research.

  • Can you do the “incremental” training of NMT?

    Incremental training improves the model continuously when new training data arrives. This fits closely the “online training” method of neural network models. So, in theory, these models can be incrementally updated relatively straightforwardly.

  • How much parallel texts do we need when we use NMT systems compared to traditional SMT systems? Is it true that NMT systems needs more data than SMT systems?

    That seems to be the consensus in the field that at least to get going you need more data so you can build phrase based machine translation system from a million words easily, but this does not seem to be the case for Neural machine translation where you need more data, especially if your data is slightly off domain, having little data then is really bad.

    The differences of type of data that goes into NMT system is that NMT currently is trained only on translated text for instance they are not even trained on any additional target language text on half which is a huge part of traditional statistical … There is no separate language model. When it comes to terminology and certain kinds of additional information, it is currently not entirely clear how to integrate that with NMT.

  • Do you think that we need more accurate (higher quality) data than is used to train SMT models?

    This is an open question at this point and requires more research.

  • How many companies use NMT on a commercial basis at this point? What’s the general feedback of them so far?

    As of today (January ’17) a number of vendors have announced or indicated offerings. We are not in a position to compare or comment on the quality of the different systems.

  • By when will NMT commercially deployable on a large scale (re industries). Will it still be largely based on SMT as a base?

    There are many components in SMT that are also used in NMT models, so there will be re-use. How ideas or code will be merged between SMT and NMT remains to be seen.

  • One of the things that’s made SMT commercially feasible is adaptivity: to adjust a system’s behaviour based on small bits of human input, without retraining on the entire data set (too costly). As I understand that also needed technology innovation. Do current NMT models have this capability, or is adaptive NMT still some ways off / or not feasible with current technology at all?

    The learning algorithm that is used to train all these parameters is called the online learning algorithm meaning that it learns by continuously processing new training examples. That means in practice you give it new sentence pairs and it learns from them. In theory, it should be pretty straight-forward to take a model that exists, have some additional training, might be different in nature, different domain, different style and then adapt the existing model to the new domain. So, there is definitely some hope for that. Theoretically this could work on a per sentence level. That you have an existing system, new sentence come in, you adapt it, and your system knows about the sentence. It is however, still in the research phase.

  • Is self-learning from post-editing more easy with NMT than with SMT?

    Theoretically, it can be done for both systems. There is proven technology for SMT, but the idea has not been fully explored for NMT.

  • How is the known terminology issues with SMT being addressed with NMT? Is deep learning supposed to solve it or make it more complicated?

    This is currently not fully explored and is still in the research stage.

  • Is predictability supposed to work even with source text? Would it be possible, using the same approach, reckoning the suitability of a source for MT?

    A neural language model (trained on the source language) may be used as indicator how similar it is to the training data, and hence its likelihood to be translated correctly. Such confidence measures are typically trained with this feature, among others.

  • What do the black spots represent in the Neural Language Model Slide?

    In the webinar on “Introduction to Neural Machine Translation” the circles stand for individual nodes in the layer. The color indicates their value – the darker the higher. In the mapping from the 1-hot-vector to the embedding you can see how white/black nodes get mapped to continuous values.

  • Are there any experiments taking into account the training data (e.g. simple sentences / few words first etc.)?

    There is a method called “curriculum learning”, where training starts with simpler, shorter sentences. This has not yet been extensively studied in the context of neural machine translation.

  • Do words have fixed embeddings? How is the information in the embedding decided on?

    These embedding is learned with everything else, here just the big picture was drawn of how everything is connected, but once you draw this picture (how everything is connected), there is a machine learning method called back-propagation where you take this model as basically a lay-out, so data added and then run a few weeks and it learns all the parameters of the models including the embedding. So, the embedding is not provided separately. There is no fixed embedding for languages that you build once and use forever.

  • Could you please explain with more details the concept of ¨Embedding¨?

    “Embedding” stands to the representation of a word in a high-dimensional continuous vector space.  The hope is that words that are similar get similar representations (i.e., the difference between their vectors is small). There are various ways such embedding can be learned. In the talk “Introduction to Neural Machine Translation”, we discussed both word2vec and within the context of a neural machine translation system. See here for recordings of these webinars.

  • Do NMT systems have problem with long sentences as SMT systems?

    Yes, that still appears to be the case. The attention mechanism in neural models is currently not reliable enough.

  • How much effort is being paid to translations into highly inflected languages?

    There have been little efforts to tailor neural machine translation specifically for morphologically rich languages but the general model has been successfully applied to languages such as Russian and Czech.

  • Could you recommend a brief bibliography on Neural Machine Translation (NMT)?

    A fairly comprehensive bibliography for neural machine translation can be found here.

  • In December ’16 Google GNMT also proposed multilingual NMT, by training several language pairs at the same time and use the same encoder-decoder, how do you think about the issues in Multilingual NMT?

    From our understanding, this may be successful for low resource languages, but building direct translation models for each language pair generally works better.

  • Can NMT be applied when the source and target languages are the same? For example, English to English. And will it perform better than translating between two different languages?

    Depends on what the task is. Models similar to the neural machine translation model have been applied to chat bot translation, where “translation” is between question and answer. For tasks such as text simplification, grammar checking, or automatic post-editing, neural machine translation systems have been successfully applied.

  • Are there specific languages that NMT is better suited for than others?

    This is too early to tell. We have seen relatively better results with morphologically rich output languages and distant language pairs.

  • Are there cases where Rule-Based Machine Translation RBMT/SMT is better than NMT?

    While the NMT technology is evolving quickly, at this moment SMT still offers more control, both in terms of writing style for specific domains using the Language Model (LM) but also using capabilities such as the ability to override and control terminology which is key in many professional translations – both at run-time as well as by training the engine accordingly. Finally, SMT systems still offer more control with their rules for example for the markup and handling of complex content such as patents or eCommerce product catalogues but also conversions such as “on-the-fly” conversion of measurements etc.

  • Do you think linguists can be involved and work together with the developers in training the machines in order to achieve better results? Is this something that’s happening already perhaps?

    There is benefit to linguistic knowledge in preprocessing and preparing language data. We have seen benefits to linguistically motivated preprocessing rules for morphology or reordering in SMT systems. It is not clear yet, if these or similar inputs would also benefit NMT systems. Since NMT are data-driven and typically use unannotated text, there is no obvious place where linguistic insight would be helpful. However, as with SMT, post-edited feedback can be used to improve translation quality in NMT.

  • What will be required to integrate NMT with translation software, CAT and TMS systems such as SDL Studio and Kilgray memoQ?

    While we cannot speak for other vendors, integrating without our Language Studio platform will be no different than it is today. We have a well-established API that is fully integrated with a wide range of CAT an TMS systems and also in use by many enterprises directly.

    An engine in Language Studio will be configured for a NMT specific or SMT specific workflow and all the appropriate pre and post processing then takes place automatically. From an external integration perspective, the technology used to perform the translation is transparent.

  • Do you see room for curated language/knowledge bases (such as FrameNet, for example) in the future of Neural MT? If so, which path do you think is a promising one?

    This is a bit difficult. There has been some success in enriching the input to NMT systems with added linguistic annotation, so FrameNet annotations may be useful as well. But this is very superficial integration of these knowledge sources. Deeper integration has not been studied yet.

  • What are some of the common search (decode) algorithms used to find the translation in neural machine translation? (i.e. Does it still employ Beam Search like Phrase-based SMT or is this substituted by the output word prediction model)

    Search is greedy and predicts one word at a time. There are beam search algorithms that give some improvements with small beam sizes. These consider the second-best word translation, third best translation, etc. and proceed from there.

  • I am wondering who will be absorbing the cost of implementing NMT? Will it be the end client, the LSPs or translators?

    Putting this question into perspective, an SMT engine that takes about 1 day to train on a multiple CPU system with about 256GB RAM. Training a NMT system such as this with CPU technology is not viable as it would take many months. Training using GPU technology still takes about 1 month. As such, the costs of training are notably higher.

    Similarly, at translation runtime (decoding), GPU technology is faster than CPU, but costlier. Using a CPU to decode NMT delivers about 600 words per minute, while the same CPU for SMT would deliver 3,000 words per minute, 5 times faster. Using GPU requires new hardware investments, which at this time are quite costly, and depending on the GPU technology use, can be several times faster than SMT.

    One the one hand the cost will likely come down over the coming years as GPU capacity increases and the systems become more efficient, on the other hand cost/benefit will depend largely on the uses cases and the industry. In some cases, the LSP might absorb some of the cost in return for improved efficiency, in other uses cases the customer might need to absorb the added cost for the improved quality.

  • Is human/BLEU scoring the preferred/only way of performing LQA for NMT output? Are models like DQF and MQM used as part of the research or is the NMT output quality above what these systems can show?

    There are many different frameworks for measurement, but BLEU is still the dominant one. Often this is cross correlated by human. DQF and MQF require a lot more time and investment. One of the preferred metrics in the LSP space has moved to productivity measurement as a core metric as this goes straight to the businesses bottom line. In this model, it is not so much the number of errors made, but how quickly the erros can be fixed and the sentence edited to  production quality.

  • In the German-English context, how would an NMT system handle a sentence such as “Ich stehe um 8 Uhr auf.”, where the two parts of a “trennbares Verb” should be identified as belonging together, although distantly separated within the sentence?

    The attention mechanism in NMT systems may align an output words to several input words, and they do not have to be next to each other. Given what we have been seen so far, NMT systems handle verbs with separable prefixes comparably well, in the input and the output.

  • What is the added value of Neural MT, compared to SMT and what are their drawbacks?

    The main benefits are generalization (the ability to learn about the translation of a word from occurrences of similar words) and use of larger context (using the entire input sentence to determine a word meaning and hence correct translation) which results is better readability. The main drawbacks are higher computational cost, some occasional unpredictable translations, less transparency and no control beyond adding bilingual training data. For the moment, NMT systems are also less customizable, but this will be remedied in the near future as many researchers, including Omniscien researchers, are working to address such issues.

  • Have the word alignment algorithms been successfully transported to the publicly available cloud engines? What problems exist in creating a proper many-to-many word alignment? (i.e. 0, 1, 2 or more words in the source language can be mapped to 0, 1, 2 or more words in the target language).

    NMT does not use word alignment in the same manner that SMT does. NMT translates as entire sentences, while SMT translates as phrases. As such, the public NMT systems that we have seen not have the word alignment available and only show the sentence alignment.

    In Language Studio, our upcoming NMT offerings will include word alignment information in the same manner as our SMT offerings that are available as part of the programmatically accessible workflow and log files. As with our current core SMT platform which is a hybrid rules, syntax and SMT engine, our NMT offering will also be hybrid to ensure that features such as word alignment is backwards compatible.

  • Is the ideal training data the same as for SMT?

    The training data is basically the same, in that you use the bilingual data for both SMT and NMT. However, you don’t use the language model at all as NMT only learns from bilingual data. There is some research into using monolingual data, but there are not substantial results that show improvement yet. As a result, if you only have a small amount of data then you may get better results from SMT using a combination of bi-lingual data and target language data for writing style in conjunction with data manufacturing such as that provided by Language Studio Professional custom engines. In addition, unlike SMT you do not yet have the ability to override terminology either during engine training or at runtime which in the case of SMT is often another key factor that assists with improved quality of the translation. All these areas are still areas of research for NMT.

  • How do you think the reliability/accuracy of NMT will evolve? Like high with frequency trading, might we get to a situation in which humans cannot check the reliability/accuracy of NMT output?

    This is a multi-dimensional question – firstly, as discussed on the webinar, the core NMT systems, while generally producing very good results in terms of understandability and fluency, but still lack the level of control available in Statistical Machine Translation (SMT) systems and errors are difficult to trace and more unpredictable. At present SMT is easier to manage and provides support for features such as glossaries, non-translatable terms, rules and format handling that are needed for higher levels of customization and suitable to more complex applications.

    In the near term, NMT does indeed deliver higher quality general fluency, readability and accuracy – this is important. However, a major feature that is available in Language Studio Professional SMT custom engines is the ability to manage and control terminology and writing style. This means NMT output may be more understandable, but may not match your style, resulting in greater levels of editing prior to publishing. To put this in context, if you compared the writing style of an automotive engineering manual to marketing for an automotive company they are very different. If your marketing content reads like an engineering manual, you would have to rewrite large portions of the sentence. At this time, both approaches to MT have benefits and disadvantages. We are working on resolve this and the other outstanding limitations of NMT for our own upcoming NMT product offerings.
    Hence at this time, depending on the application, the technology choice might vary.

    As to applications where “raw” MT is used (as opposed to post-edited MT), again the application requirements will drive the choice of technology. For example, where terminology normalization is required, for example for analytics applications, SMT is far better suited since it allows full control over terminology, even at runtime. If fluency is required on the output some NMT engines will do better than SMT, so again detailed understanding of the requirements is needed to assist with the choice of technology.

Types of Machine Translation

  • What is custom Machine Translation?

    Custom machine translation is the adaptation of a machine translation system to be specialized towards a specific domain or topic.

    RBMT and SMT systems can both be customized. However, RBMT customization involves working a lot more with dictionaries and glossaries whereas SMT customization involves the gathering and preparation of bilingual and monolingual data that are used for machine learning. Some Hybrid RBMT systems will also offer some statistical customization, but this is typically a statistical smoothing approach that attempts to repair a lower quality translation.

    SMT systems that are customized are typically customized using either the Dirty Data SMT model or the Clean Data SMT model. It is much easier to build a custom machine translation engine with the Dirty Data SMT model than the Clean Data SMT model as the Dirty Data SMT model simply mixes data together with little human cognition, understanding or skill required. The resulting custom engine is typically unpredictable and lower quality than a fully customized machine translation engine using the Clean Data SMT model. While a custom engine built on the Dirty Data SMT model may be higher quality than generic machine translation output, it will still have many issues that limit its practical use. There are many machine translation vendors offering “me too” machine translation solutions based on the Dirty Data SMT model where data can be uploaded for customizing, but little real control is provided. Some of these vendors even claim to be using Clean Data SMT model, but also mislead their customers when in reality they only do the most basic of cleaning and usually do not even comply with even one of the four mandatory criteria in the Clean Data SMT model.

    Custom machine translation, when performed correctly, will deliver notably higher quality translation output than generic machine translation. However, many who try custom machine translation fail due to underestimating the effort and skills required. Some machine translation vendors make it sound easy with unrealistic promises. The reality is that it is a complex task to fully customize a machine translation engine and every customization will be different. Any machine translation vendor that tries to say otherwise is either incompetent or deliberately trying to mislead you in order to sell their product or service.

    Omniscien Technologies offers our customers a full customization with expert Language Studio™ Linguist guidance throughout the process. This empowers our users to have complete control over the customization process and desired outcome while experts that have built thousands of successful custom engines execute the complex tasks and guide the process to deliver optimal results.

  • What is generic Machine Translation?

    When a machine translation system has not been customized and is not specialized in a specific domain, this is commonly referred to as “Generic” machine translation. Google Translate, Microsoft Translator and other free public translation systems fall into this category. Generic machine translation can often be useful for getting a general meaning of a piece of text, but is prone to grammar, syntax and other errors.

    Generic machine translation is not suitable for republishing content without considerable amounts of human editing. Many translators have been given generic translation output and asked to edit it for lower rates than normal translation. This has caused a number of issues in the translation industry, including giving machine translation as a whole a bad name. Some translators will even refuse to edit generic machine translation as it is often more work to edit than to fully translate the same text.

    Because giant companies like Google and Microsoft offer generic machine translation, many mistakenly believe that the quality of translation is the best possible and state-of-the-art. A customized machine translation engine that has been expertly adapted for a purpose will deliver much greater quality translations and in many cases translated sentences will require little or no editing at all. Professional translators and enterprises should consider a customized machine translation solution in order to deliver quality translations.

  • What is Do-It-Yourself (DIY) Machine Translation?

    Do-It-Yourself (DIY) Machine Translation, also known as “Self Service Machine Translation” has recently emerged as a popular model for customizing a machine translation engine. In the last 2-3 years, the DIY machine translation space has been populated by many “me too” vendors that offer little additional value over what can be downloaded as open source Moses tools and are difficult to differentiate from each other in terms of actual value add.

    DIY machine translation provides only a subset of the customization capabilities of a full customization process. For a more detailed comparison see the article “Understanding the Difference between Full Customization and Do It Yourself Customization ”

    The appeal of the DIY model is that it is perceived to give the user complete control over the customization process. Many DIY MT vendors have marketed heavily saying that they empower their users as they have the ability to upload their translation memories into a DIY machine translation system. However, in reality users are being stripped of the power to control the engine in these systems. Simplicity often hides reality.

    Ironically, the DIY model is frequently referred to as “upload and pray”, where the user has no real control over their engines quality beyond the data they are uploading and in most cases do not even know the contents or quality of the translation memories that they are uploading. Frequently the data comes from many sources, even third parties. Many Language Service Providers (LSPs) upload translation memories that were passed onto them by their client or another LSP, or even data acquired from organizations such as TAUS. The net result is a mixing pot of inconsistent translations memories that are difficult to manage or remain unmanaged that are combined and in turn create inconsistent machine translation output. This issue is similar to what LSPs face when they use a third party translation memory or mix translation memories to pre-translate segments as part of a translation project.

    The inherent issue with DIY machine translation is that it implies that the user knows how to do-it-themselves. Case studies like IOLAR’s demonstrate clearly that high quality machine translation requires considerably more effort, knowledge and skill than simply loading data into a system for training. Achieving a quality level that was useable for efficient post editing was clearly not the simple task that TAUS and third-party DIY proponents had conveyed. The article “Understanding the Difference between Full Customization and Do It Yourself Customization” has a section that lists of some skills necessary in order to build a high quality machine translation engine.

    Simply being able to upload data is not user empowerment. Such DIY systems provide a “pretty” user interface that makes it very easy and fast to create low quality custom machine translation engines with only minimal levels of control of the data upon which the custom engine is built. User empowerment comes from having total control of the data and being able to manipulate and refine the data to get the optimal result.

    There are many Natural Language Programming (NLP) specialists and computational linguists that have worked extensively with the Moses tools and also have an extensive research background. However, for most the focus has been on algorithmic improvements that offer a small overall improvement. While the algorithms are certainly important, the essence of a high quality translation system that requires the least amount of effort to post-edit and delivers the highest Return On Investment (ROI) lies in the data upon which the engine is customized. Data quality and appropriateness are areas that academia has not put a lot of effort into when compared to algorithms. One of the essential factors in delivering near human quality machine translation is the correct and most optimal data, which is the focus of the Clean Data SMT model.

    There are three common forms of DIY MT:

    1. Downloading the open source Moses SMT system either directly from the site or with a third party installer.
    2. Using a third party Software-as-a-Service (SaaS) platform that has packaged Moses behind a web portal making it easy to use. Such services typically put a web portal over the top of Moses and allow users to upload data.
    3. Licensing a packaged version of #2 above and installing it on your own computer systems.

    All of these solutions are based on the Moses open source machine translation system, a project that was created by and is maintained under the guidance of Professor Philipp Koehn. Professor Koehn is also the Chief Scientist and one of the shareholders in Omniscien Technologies.

    Professor Koehn worked with many others contributors to develop the Moses decoder including Hieu Hoang, Chris Dyer, Josh Schroeder, Marcello Federico, Richard Zens, and Wade Shen. The Moses decoder has progressively matured and is the de facto benchmark for research in the field. However, Moses was not intended to be a commercial machine translation offering. There are considerable amounts of additional functionality, beyond providing a web based user interface for Moses, that are not included in Moses that are essential in order to offer a strong and innovative commercial machine translation platform.

     

    When conceiving the idea of Moses, the primary goal was to foster research and advance the state of MT in academia by providing a de facto base from which to innovate from.
    Currently the vast majority of interesting MT research and advancements still takes place in academia. Without open source toolkits such as Moses, all the exciting work would be done by the Googles and Microsofts of the world, as is the case in related fields such as information retrieval or automatic speech recognition.
    As a platform for academic research, Moses provides a strong foundation. However, Moses was not intended to be a commercial MT offering. There are considerable amounts of additional functionality, Philipp Koehnbeyond providing a web based user interface for Moses, that are not included in Moses that are essential in order to offer a strong and innovative commercial MT platform.

    Professor Philipp Koehn,
    Chief Scientist, Omniscien Technologies

     

    To develop a full featured enterprise class translation platform that addresses the shortcomings of the DIY machine translation approach, Professor Koehn has worked with the Omniscien Technologies team of linguists, researchers and developers for many years to build a translation and language processing solution that is truly innovative – Language Studio™. The innovation goes beyond software into the entire process around the customization of a high quality translation engine. This process has been captured in the Clean Data SMT model and is not available as part of any of the DIY machine translation solutions.

    Some DIY vendors are copying some of the features that Language Studio™ has had for many years, but still only a subset of the overall functionality is provided by any DIY solution in the market today. Omniscien Technologies continues to innovate and add new features and functionality on a regular basis. Working with Professor Philipp Koehn and other world leaders in the field of machine translation research ensures that Language Studio™ has the latest advances in commercial and enterprise class machine translation technologies.

  • What is the difference between “Clean Data SMT” and “Dirty Data SMT”?

    There are two common approaches to SMT corpora gathering:

    Dirty Data SMT Model

    Dirty Data SMT Model

    Cleaned Data SMT Model

    Clean Data SMT Model

    Theory:
    Good data will be more statistically relevant and bad data will be statistically irrelevant.
    Theory:
    Bad or undesirable patterns cannot be learned if they don’t exist in the data
    Data:

    • Gathered from as many sources as possible.
    • Domain of knowledge does not matter.
    • Data quality is not important.
    • Data quantity is important.
    Data:

    • Gathered from a small number of trusted quality sources.
    • Domain of knowledge must match target.
    • Data quality is VERY important.
    • Data quantity is less important.
    Benefits:

    • Broad general coverage of vocabulary in many domains
    • Easy to find corpora
    • Useful for gist level translations
    • No real skill required, simply upload data.
    Benefits:

    • Higher quality translation
    • Control over translations, vocabulary and writing style.
    • Deep coverage in a specific domain.
    • More suited to professional translation community
    • Issues can be understood and fixed with small amounts of focused corrective data.
    • Consistent terminology.
    Disadvantages:

    • Little or not control over translations, vocabulary and writing style.
    • Inconsistent terminology.
    • Shallow depth in any one domain frequently resulting in mixed context for ambigious terms.
    • Domain bias towards where most data was available rather than the desired domain.
    • Less useful for professional translation due to editing effort required
    • Very hard to fix any issues. Primary resolution path is to add more dirty data and hope that it fixes the issue.
    Disadvantages:

    • More unknown words in early versions of the engines (but these can be quickly fixed).
    • More difficult to find corpora.
    • Low quality translations on out of domain content.
    • High level of skill and effort required in understanding and preparing data.

    Omniscien Technologies (formerly Asia Online) pioneered the Clean Data SMT Model and has built thousands of custom translation engines with this approach. Even when dealing with some of the largest companies in the world that have large volumes of corpora that have been collected over many years, there is still not enough corpora to deliver high quality translations, especially when a variety of writing styles are required (i.e. marketing and technical manuals). Language Studio™ Advanced Data Generation and Manufacturing Technologies (ADM) were developed to address this data shortfall. A customized quality development plan is executed for every custom translation engine. While a custom translation engine will always be higher quality when your data is provided, ADM has been proven to work well even when little or no corpora is available, an engine can be customized.

    Clarifying the Clean Data SMT Model

    Joker

    To better understand the Clean Data SMT Model, it is helpful to explain what Clean Data SMT is not. Some MT technology vendors will offer rapid customization – “simply upload your corpora and that is it!!” – We refer to this model as “upload and pray”. Any customization process that does not have deep hands on linguistic knowledge and a deep understanding on the data, target audience, preferred writing style cannot be considered a Clean Data SMT Model. High quality translations can never be achieved by simply uploading your existing data. The Clean Data SMT Model cannot be achieved without a deep understanding of the data and the goals of the translation engine that only a human specialist can provide.

    Many have experimented with these vendors or open source tools such as Moses and found that frequently the quality their translations are lower than Google. Google follows a Dirty Data SMT Model, but has vast volumes of corpora that are probably unmatched by any other organization. A Dirty Data SMT Model is the correct model for Google as Google is goal is to translate anything, for anyone, anytime. But the significant limitations of the Dirty Data SMT Model mean that the output translation quality is not generally suitable for professional translation due to lack of control and the inability to address issues.

    Other MT vendors frequently claim that they offer Clean Data SMT, but in reality, they are only removing formatting tags and bad data (“cleaning”) Clean Data SMT requires linguistic skills that only human specialists possess to develop and execute a data strategy that is specific to your custom engine, target audience, preferred writing style and preferred terminology. This is the approach taken with every full customization when creating your own custom engine with Language Studio™.

    Clean Data SMT and SMT in general require very large bilingual corpora. One of the major advantages of the Language Studio™ platform is the extensive collections of bilingual corpora, known as Foundation Data, in more than 540 language pairs and is progressively adding a set of 12 Foundation Domains to each language pair. When combined with ADM, the data requirements are greatly reduced and it is even possible to create custom engines when no bilingual corpora are available.

    Language Studio™ has the Clean Data SMT Model as its data approach and SMT technologies at its core, but is built around a Hybrid Machine Translation Model to take advantage of all the best features in SMT, RBMT and Syntax approaches.

     

     

  • What is Hybrid Machine Translation?

    Hybrid machine translation is a method of machine translation that is characterized by the use of multiple machine translation approaches within a single machine translation system. The motivation for developing hybrid machine translation systems stems from the failure of any single technique to achieve a satisfactory level of accuracy.

    Although there are several forms of hybrid machine translation such as Multi-Engine, statistical rule generation and multi-pass, the most common forms are:

    • Rules Post-Processed by Statistics: Translations are performed using a rules based engine. Statistics are then used in an attempt to adjust/correct the output from the rules engine. This is also known as statistical smoothing and automatic post editing. This is more of a “Band-Aid” approach to machine translation where there is an attempt to improve lower quality output from an RBMT engine rather than addressing the root cause of issues.
    • Statistics Guided by Rules: Rules are used to pre-process data in an attempt to better guide the statistical engine. Rules are also used to post-process the statistical output to perform functions such as normalization. This approach has a lot more power, flexibility and control when translating. Many issues can be addressed at their root causes through rules that go beyond the capabilities on a statistical only approach.

    Language Studio™ uses a more advanced form of statistics guided by rules that also incorporates syntax based and multi-pass machine translation. The specific approach will vary depending on the language pair and domain complexity. Language Studio™ supports the ability to enable or disable the various hybrid components at an individual engine level.

  • What is Syntax Based Machine Translation (SBMT)

    The goal of syntax-based machine translation techniques is to incorporate an explicit representation of syntax into the statistical machine translation systems. Syntax-based translation is based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial) parse trees of sentences/utterances.

    The idea of syntax-based translation is quite old in MT, though its statistical counterpart did not take off until the advent of strong stochastic parsers in the 1990s. Examples of this approach include DOP-based MT and, more recently, synchronous context-free grammars. One of the challenges of syntax based approach is translation speed. Improvements in translation quality have been noted with the use of syntax based translation, but the speed of translation is significantly less than other approaches.

    Amr Ahmed and Greg Hanneman of the Language Technologies Institute at Carnegie Mellon University wrote a very good overview of Syntax-Based Statistical Machine Translation.

  • What is Statistical Machine Translation (SMT)

    Statistical machine translation (SMT) learns how to translate by analyzing existing human translations (known as bilingual text corpora). In contrast to the Rules Based Machine Translation (RBMT) approach that is usually word based, most mondern SMT systems are phrased based-based and assemble translations using overlap phrases. In phrase-based translation, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called phrases, but typically are not linguistic phrases, but phrases found using statistical methods from bilingual text corpora.

    Analysis of bilingual text corpora (source and target languages) and monolingual corpora (target language) generates statistical models that transform text from one language to another with that statistical weights are used to decide the most likely translation.

    Clean Data SMT and SMT in general require very large bilingual corpora. One of the major advantages of the Language Studio™ platform is the extensive collections of bilingual corpora, known as Foundation Data, in more than 540 language pairs and is progressively adding a set of 12 Foundation Domains to each language pair. When combined with ADM, the data requirements are greatly reduced and it is even possible to create custom engines when no bilingual corpora are available.

    Language Studio™ has the Clean Data SMT Model as its data approach and SMT technologies at its core, but is built around a Hybrid Machine Translation Model to take advantage of all the best features in SMT, RBMT and Syntax approaches.

  • What is Rules Based Machine Translation (RBMT)?

    Rule Based Machine Translation (RBMT) systems were the first commercial machine translation systems and are based on linguistic rules that allow the words to be put in different places and to have different meaning depending on context. RBMT technology applies to large collections of linguistic rules in three different phases: analysis, transfer and generation. Rules are developed by human language experts and programmers who have deployed extensive efforts to understand and map the rules between two languages. RBMT relies on manually built translation lexicons, some of which can be edited and refined by users to improve translation.

    RBMT allows for some control via lexicon and user dictionary refinements and is relatively predictable in its output. However, these refinements can be very time consume to implement and maintain and in some cases can lead to lower quality translations due to ambiguity of terms. As translations are provided by rules, often the ouput can read more “machine like” in writing style and although translations can be understandable, they are often not fluent. This is frequently referred to as “gist” quality, where the substance of the translation is reasonably clear, but considerably post-editing work is required to adapt the translation output to a specific target audience and writing style.

    RBMT developers have recently attempted to address some of the limitations of RBMT by augmenting their core RBMT technology with some of the techniques of Statistical Machine Translation (SMT) and have been marketing their products as a Hybrid MT model. There are a number of different “Hybrid MT” models and each should be understood for their benefits and limitations.

    A more technical summary of RBMT can be found on Wikipedia: http://en.wikipedia.org/wiki/Rule-based_machine_translation