In the process of building thousands of custom machine translation engines for clients the Omniscien team found many manual tasks to be repetitive and often too large and too error prone to be performed with humans alone. Over many years a large number of internal tools were developed to help automate these processes and improve data quality. With the release of Language Studio V6.0 these tools are now available as part of the standard Language Studio platform.

Language is a thing of beauty. Mastering a language, even your mother tongue, can be a challenge. The ability of machines to understand language has improved rapidly in recent years. Artificial intelligence has played a major role. Natural Language Processing (NLP) has seen major leaps forward as has the core technologies used for machine translation.

The one thing that all these new technologies have is their ever-growing need for high-quality data. Language Studio provides a wide range of tools that are designed with two core goals:

  • Organize, classify, and clean existing language assets.
  • Create new high-quality bilingual data.

Existing language assets are extremely valuable but can be in forms that are not easily useable or over time have become disorganized as different platforms and technologies are deployed in organizations. Language Studio quickly prepares data in clean and normalized formats, and analyzes quality so that the maximum value can be obtained.

Few organizations have the data volumes needed to create a high-quality machine translation engine and many are starting out with no data at all. Language Studio provides a series of tools that mine, synthesize artificial data, and match bilingual data. Millions of high-quality bilingual sentence can be created in just a short period of time. These tools are all available as part of the Professional Guided Custom Machine Translation Engine training process.

While others try to provide understandable generic translations, our goal is to allow you to publish content with the least amount of human effort and edits possible. Specialized Industry Domains complimented with synthesized data achieve exactly that. Omniscien provides the data, tools, and expertise for success.

There are no short cuts to building a high-quality MT engine, but there is a process and a set of tools that make it easier and faster. With more than 10,000 custom engines developed, the Omniscien team provides a proven path to optimal MT quality.

– Dion Wiggins, CTO, Omniscien Technologies

Extract and Create Bilingual Glossaries and Terminology

Language Studio can import and process industry standard glossary formats such as TBX. To extend a TBX or create a new glossary can take a considerable amount of time. To quickly create your own bilingual glossaries, Language Studio provides a comprehensive set of tools that set and resolve bilingual glossary terms. Glossaries can be used as training data for custom machine translation engines or as runtime rules when processing a translation.

Reduce the time to create a glossary for a body of text from hours to minutes. Language professionals can quickly validate and fine-tune the automated results. When creating bilingual glossaries, language professionals need only check and adjust suggested glossary phrases, increasing their productivity from an average of 20 to more than 160 terms per hour.

Glossary and Term Extraction

Analyze a body of text to automatically extract glossary terminology in minutes. After reviewing the suggested glossary terms, selected terms can be sent to automated Bilingual Glossary Creation.

Bilingual Glossary Creation

Automatically resolve glossary terminology across languages. Recall existing user defined terminology translations and suggest new terminology translations.

Create and Match Bilingual Data

Turn your legacy language data assets into “data gold”. Many organizations and Language Service Providers (LSPs) have large numbers (sometimes even tens of thousands) of files that have been built using multiple platforms and with sometimes non-optimal management controls and filing systems.

Simply upload files and automatically identify and match their bilingual document pairs and extract high-quality bilingual sentences. Even monolingual data can be leveraged as a driver for synthetic data. These are just a few of the many ways that Language Studio creates new data assets that bolster your own custom machine translation engine.

Bilingual Document Pair Matching

Upload thousands of documents in different languages and formats. Artificial Intelligence is used to automatically determine pairs of documents that are translations of each other. Paired documents can then be processed with automated Bilingual Sentence Matching to create bilingual sentences.

Bilingual Sentence Matching

Automatically extract matching bilingual sentences pairs from bilingual documents. Produces output in common translation memory formats such as TMX, XLIFF, CSV, and tab-pair. Create translation memories and bilingual sentences that can be used in a wide variety of localization tools and as training data for customizing your own machine translation engine.

Data Synthesis

Human translated data is the best data to teach a machine how to translate. Often there is only a small amount of data or no data at all that can be used. Synthesized bilingual sentences are a fast way to build the data volume needed to customize a machine translation engine to your style, genre, and vocabulary.

Bilingual Data Mining

Point Language Studio tools to a bilingual website to automatically download all its contents, match pages that are translations of each other, and extract the bilingual sentences from the matched pages.

Extract, Analyze, Organize and Normalize Data Assets

Basic organization, preparation, and normalization of data files are often a challenge, especially for legacy or acquired third-party data. Automatically analyze, convert, extract, compare, measure, normalize, bulk rename, and classify.

Language Identification

Automatically detect the language of a file and organize the files into folders grouped by language.

Bulk Renaming

Leverage data that you have on file to rename all your files so that they are named consistently.

Bulk Encoding Conversion

Automatically detect the encoding of a file and convert it to a specified encoding such as UTF-8.

Bulk Format Conversion and Text Extraction

Convert between a wide variety of caption and subtitle formats such as SRT, TTML, DFXP, WebVTT, Adobe PDF, Image, Word, PowerPoint, Excel, Email, Open Office, CSV, HTML, XML, plain text, and more.

Advanced Data Cleaning

Deep analysis of data to determine encoding, formatting, noise, “junk” and quality data. Fine-tune data, fix broken/split sentences, rate grammar, and text quality. Once analysis is complete, extract just the content you want that matches powerful user-defined filters. Clean up your legacy translation memories in a flash.

Domain Classification and Similarity

Automatically analyze and classify language assets by domain. Use pre-defined Language Studio domains or specify your own domains for custom domain classification.

Sentence Segmentation and Split Sentence Joining

Accurately break text at sentence boundaries. Repair content that has sentences broken/split across lines. Sentences extracted from PDFs that have a carriage return splitting each line are automatically repaired back to full sentences. A combination of machine learning technologies and rules enable a high level of accuracy. Supports a variety of languages, including Asian languages such as Chinese, Japanese, Korean and Thai that do not have spaces between words or do not have end-of-sentence boundary markers.

For machines to understand text they must understand each part of the text in a form known as tokenization. Tokenization is the process of adding spaces between each token. A token is a word or other logical part of a sentence such as punctuation or numbers. This is needed for all languages including those that normally have no spaces such as Chinese, Japanese, Korean and Thai.

Entity Extraction

Extract named entities from large bodies of text to better understand and process data. Entities include people names, organization names, events, dates, time, product names, and many more.

Sentiment Analysis

Sentiment Analysis tools identify and extract opinions, feelings, attitudes, opinions, and emotions of the author of a piece of text. Sentiment can be based on individual sentences, paragraphs, or entire documents.

Compare Multiple Files

Compare 2 files to determine their differences and then accept the changes that you prefer.

Measure Change Effort

Compare and measure multiple file versions with scored metrics. Start with a source language file then add the machine translation, post-edited, and quality assurance versions for a full set of comprehensive metrics.

Automated Quality Metrics

Measure the quality of machine translations against humans with a variety of metrics including BLEU, F-Measure, Word Error Rate (WER), Translation Edit Rate (TER), Translation Edit Rate Plus (TERP), METEOR, ROUGE, NIST, RIBES, and BEER.

Integrated Workflow Studio Features

Workflow Studio provides a wide range of Natural Language Processing (NLP) and advanced text analytics tools that are tightly integrated into workflow automation and scripting tools.

