Powerful Tools for Data Creation, Preparation, and Analysis
Convert legacy files into “data gold”. Create new data quickly and efficiently.
In the process of building thousands of custom machine translation engines for clients the Omniscien team found many manual tasks to be repetitive and often too large and too error prone to be performed with humans alone. Over many years a large number of internal tools were developed to help automate these processes and improve data quality. With the release of Language Studio V6.0 these tools are now available as part of the standard Language Studio platform.
Language is a thing of beauty. Mastering a language, even your mother tongue, can be a challenge. The ability of machines to understand language has improved rapidly in recent years. Artificial intelligence has played a major role. Natural Language Processing (NLP) has seen major leaps forward as has the core technologies used for machine translation.
The one thing that all these new technologies have is their ever-growing need for high-quality data. Language Studio provides a wide range of tools that are designed with two core goals:
- Organize, classify, and clean existing language assets.
- Create new high-quality bilingual data.
Existing language assets are extremely valuable but can be in forms that are not easily useable or over time have become disorganized as different platforms and technologies are deployed in organizations. Language Studio quickly prepares data in clean and normalized formats, and analyzes quality so that the maximum value can be obtained.
Few organizations have the data volumes needed to create a high-quality machine translation engine and many are starting out with no data at all. Language Studio provides a series of tools that mine, synthesize artificial data, and match bilingual data. Millions of high-quality bilingual sentence can be created in just a short period of time. These tools are all available as part of the Professional Guided Custom Machine Translation Engine training process.
Talk to the Omniscien team to discuss how best to get value out of your existing language assets and the best strategy to create new high-quality bilingual data.
“While others try to provide understandable generic translations, our goal is to allow you to publish content with the least amount of human effort and edits possible. Specialized Industry Domains complimented with synthesized data achieve exactly that. Omniscien provides the data, tools, and expertise for success.
There are no short cuts to building a high-quality MT engine, but there is a process and a set of tools that make it easier and faster. With more than 10,000 custom engines developed, the Omniscien team provides a proven path to optimal MT quality.“
– Dion Wiggins, CTO, Omniscien Technologies
Available as two Platform Editions specifically designed to match different business needs.
- Product Overview
- Machine Translation
- Custom MT Engines
- Industry Domains
- Data Creation Tools
- Clean Data MT Approach
- Ways to Translate
- Hybrid NMT/SMT Engines
- Detailed Features
- Supported Languages
- Document Formats
- Deployment Models
- Data Security & Privacy
- Secure by Design
- Languge Studio Secure Cloud
Hosted by Omniscien - Language Studo Enterprise
On-premises or private cloud - Portal and APIs
Extract and Create Bilingual Glossaries and Terminology
Language Studio can import and process industry standard glossary formats such as TBX. To extend a TBX or create a new glossary can take a considerable amount of time. To quickly create your own bilingual glossaries, Language Studio provides a comprehensive set of tools that set and resolve bilingual glossary terms. Glossaries can be used as training data for custom machine translation engines or as runtime rules when processing a translation.
Reduce the time to create a glossary for a body of text from hours to minutes. Language professionals can quickly validate and fine-tune the automated results. When creating bilingual glossaries, language professionals need only check and adjust suggested glossary phrases, increasing their productivity from an average of 20 to more than 160 terms per hour.
Glossary and Term Extraction
Analyze a body of text to automatically extract glossary terminology in minutes. After reviewing the suggested glossary terms, selected terms can be sent to automated Bilingual Glossary Creation.
Terminology Extraction Glossary Extraction NLP
Bilingual Glossary Creation
Automatically resolve glossary terminology across languages. Recall existing user defined terminology translations and suggest new terminology translations.
Terminology Resolution Glossary Resolution Bilingual Glossary
Create and Match Bilingual Data
Turn your legacy language data assets into “data gold”. Many organizations and Language Service Providers (LSPs) have large numbers (sometimes even tens of thousands) of files that have been built using multiple platforms and with sometimes non-optimal management controls and filing systems.
Simply upload files and automatically identify and match their bilingual document pairs and extract high-quality bilingual sentences. Even monolingual data can be leveraged as a driver for synthetic data. These are just a few of the many ways that Language Studio creates new data assets that bolster your own custom machine translation engine.
Bilingual Document Pair Matching
Upload thousands of documents in different languages and formats. Artificial Intelligence is used to automatically determine pairs of documents that are translations of each other. Paired documents can then be processed with automated Bilingual Sentence Matching to create bilingual sentences.
Document Alignment Document Matching
Bilingual Sentence Matching
Automatically extract matching bilingual sentences pairs from bilingual documents. Produces output in common translation memory formats such as TMX, XLIFF, CSV, and tab-pair. Create translation memories and bilingual sentences that can be used in a wide variety of localization tools and as training data for customizing your own machine translation engine.
Sentence Alignment Sentence Matching
Data Synthesis
Human translated data is the best data to teach a machine how to translate. Often there is only a small amount of data or no data at all that can be used. Synthesized bilingual sentences are a fast way to build the data volume needed to customize a machine translation engine to your style, genre, and vocabulary.
Data Synthesis Synthetic Data Manufactured Data
Bilingual Data Mining
Point Language Studio tools to a bilingual website to automatically download all its contents, match pages that are translations of each other, and extract the bilingual sentences from the matched pages.
Manufactured Data Data Manufacturing Data Mining
Extract, Analyze, Organize and Normalize Data Assets
Language Identification
Language ID Language Classification Language Detection NLP
Bulk Renaming
File Classification NLP
Bulk Encoding Conversion
Encoding Detection Encoding Conversion NLP
Bulk Format Conversion and Text Extraction
Convert between a wide variety of caption and subtitle formats such as SRT, TTML, DFXP, WebVTT, Adobe PDF, Image, Word, PowerPoint, Excel, Email, Open Office, CSV, HTML, XML, plain text, and more.
Document Conversion Text Extraction NLP
Advanced Data Cleaning
Difference Analysis Change Analysis File Compare NLP
Domain Classification and Similarity
Data Classification Domain Identification NLP
Sentence Segmentation and Split Sentence Joining
Accurately break text at sentence boundaries. Repair content that has sentences broken/split across lines. Sentences extracted from PDFs that have a carriage return splitting each line are automatically repaired back to full sentences. A combination of machine learning technologies and rules enable a high level of accuracy. Supports a variety of languages, including Asian languages such as Chinese, Japanese, Korean and Thai that do not have spaces between words or do not have end-of-sentence boundary markers.
Sentence Joining Sentence Splitting Sentence Segmentation NLP
Tokenization
Tokenize NLP
Entity Extraction
Extract named entities from large bodies of text to better understand and process data. Entities include people names, organization names, events, dates, time, product names, and many more.
Named Entiry Recognition NER Term Extraction NLP
Sentiment Analysis
Sentiment Analysis tools identify and extract opinions, feelings, attitudes, opinions, and emotions of the author of a piece of text. Sentiment can be based on individual sentences, paragraphs, or entire documents.
Sentiment Analysis Opinion Mining NLP
Compare Multiple Files
Difference Analysis Change Analysis File Compare NLP
Measure Change Effort
Metrics Change Analysis Difference Analysis NLP
Automated Quality Metrics
Measure the quality of machine translations against humans with a variety of metrics including BLEU, F-Measure, Word Error Rate (WER), Translation Edit Rate (TER), Translation Edit Rate Plus (TERP), METEOR, ROUGE, NIST, RIBES, and BEER.
Automated Metrics Machine Translation Metrics NLP
Integrated Workflow Studio Features
Workflow Automation Text Analytics Human Language Technology NLP Scripting