If you are interested in attending our upcoming webinars, please click here to sign up.

No Data – No Problem!

No Data – No Problem!

Want to build a fabulous new custom machine translation engine for a new client of yours?

There is, however, one problem: You just do not have enough data you can give Omniscien Technology to build this engine with? What do you do? Is there a solution?

Fear not – there is a solution!

At Omniscien Technologies we often encounter customers who have little or even no data at all and so we have developed a set of processes to address such cases. Usually this scenario happens when there is a client who is new to the LSP and no historical linguistic assets are available or it also happens in most Enterprise use cases.

With a Language Studio PROFESSIONAL custom machine translation engine, we provide a process that addresses this concern as well as a number of other levels of data.



We start the process by understanding the exact requirement and then, based on that understanding, tailor an Engine Production Plan that is specifically designed to meet our customer’s goals. This is a collaborative effort, with the Omniscien Technologies team using Language Studio tools for 90-95% of the work and then reverting back to the customer’s team when there are language and client specific tasks to be performed.

The details of the plan depend on the customer’s project and requirements, but below are some examples of the kind of work the Omniscien Technologies experts may perform for our customers as part of the engine development:

  1. Terminology Analysis: If our customers are able to provide the source text or other similar text that is expected to be translated, then we can perform a number of terminology related tasks:
    1. Term Extraction: Analyze and extract terminology with a frequency counter so that we know how important it is in terms of impact on the document.
    2. Unknown Terms Analysis: Cross reference the extracted terms with the customers’ translation memories and the translation memories used to train Language Studio Industry engines to determine which terms are unknown.
    3. Term Candidate Generation: Generate 3 possible translations for unknown terms and analyze the likelihood of the best term and recommend the most likely one. A tool is provided for term review that allows a translator to increase the typical term resolution rate from 20-30 terms per hour to around 160 terms per hour. The data and tools are prepared by Language Studio Experts and then sent to customer for the human review element.
    4. Term Confidence Analysis: While the engine may be able to translate the term, it may not have high confidence in the quality of the term translation. A list of terms and their translations can be provided for a linguist to review.
    5. Term Variance Analysis: A list of terms that have multiple translations in your translation memories. This gives the customer the opportunity to normalize terminology.


  1. Data Manufacturing: The creation of data for style, grammar and vocabulary choice.
    1. Bilingual Data Alignment: If customers are able to obtain documents in 2 languages from their client, we can analyze and create translation memories for our customers.
    2. Term Generation: As noted above with Term Candidate Generation, depending on the amount of effort the customers’ team is willing to contribute for term validation, many of the lower frequency (lesser importance) terms can be auto-generated, with those that are not correct being edited in the post editing stage. The engine will learn progressively the correct translation based on post editing feedback.
    3. Language Model Generation: The creation of large volumes of target language monolingual data for writing style guidance.
    4. Crawling: We are able to mine data from the web that has a similar writing style and add it to the customers’ engines data.
    5. Data Mining: Omniscien Technologies has gathered and prepared billions of words in every language that we support and have them indexed in a variety of ways. We are able to mine and extract data in the target language that represents the writing style and context of the customers’ engine.


  1. Industry Engines as a Base: The data used to build our Industry engines provides a solid foundation from which to build the customers’ engine on top of. We layer the data that customers provide in addition to the manufactured data to create a high quality translation output.




  1. Post Editing Feedback: Even with a small amount of initial data, if customers provide their post edited data back to the engine on a regular basis, they can achieve notable improvements in translation quality very quickly.


Want to know more? Contact us at sales@omniscien.com where our Language Studio experts will help and guide you through our tools and product.