The broad set technologies used in Omniscien products cover a wide scope of use cases and purposes. Many of the technologies are used in multiple products or exposed within product features for users to integrate into their workflows via scripting. In combination, these products can offer novel solutions to many of today’s most advanced data processing challenges.
The information on this page provides a summary of a subset of the technologies developed and used by Omniscien. At the heart of all the processes and technologies is the focus on reducing human effort while increasing accuracy.
Technologies fall into the following categories:
- Artificial Intelligence and Machine Learning Technologies
- State-of-the-Art Machine Translation Technologies
- Machine Translation Engine Training Technologies
- Data Gathering, Data Synthesis, and Data Creation Technologies
- Data Cleaning Technologies
- Measurement Technologies
- Natural Language Processing and Data Processing Technologies
- Workflow Automation Technologies
- Subtitle, Video, Audio, and Media Processing Technologies
- Data Source and Integration Technologies
- Data Conversion Technologies
Artificial Intelligence and Machine Learning Technologies
All Omniscien products are built on a core base of Artificial Intelligence, Machine Learning, and Natural Language Processing. Machine learning enables machines to work more like humans so that humans don't have to work more like machines. Each feature is designed to augment human intelligence, enhance productivity, increase quality, and reduce cost. Artificial intelligence enables the processing and organization of data that simply would not be cost-effective or feasible with a human only approach.
Examples of artificial intelligence and machine learning-based technologies include:
- Neural Machine Translation
- Video Transcription
- Language Identification
- Document matching
- Dialog Extraction
- Quality Analysis
And many more...
State-of-the-Art Machine Translation Technologies
Machine translation technology has made many advances in recent years and continues to change at a rapid pace. While the latest Neural Machine Translation (NMT) technology produces more fluent and natural sounding sentences, there are still some weaknesses. Many of those weaknesses are better served by the legacy Statistical Machine Translation (SMT) technology. The Omniscien team has developed a unique hybrid approach that integrates the benefits of both technologies to produce a higher quality translation output.
- Statistical Machine Translation (SMT): A mature and stable machine translation technology. The outputs from SMT training have many purposes beyond translation and are used to build dictionaries, subtitle splitting positions, sentence joining determination, confidence scoring and more.
- Neural Machine Translation (NMT): Deep Recurrent Neural Networks (RNN) with Deep Transition Cells, Google Transformer models, and Multi-source models.
- Hybrid Machine Translation: Combines multiple machine translation technologies into a single, higher quality translation output.
- Confidence Scoring: Machine translated output is automatically scored to determine the quality of the translation. Confidence scores can be used as part of Hybrid machine translation processing and output in XLIFF data for use in Translation Management Systems (TMS)
- Runtime Glossaries and Do-Not-Translate lists: Apply runtime glossaries to each job to lock in terminology. Specify which terms should not be translated such as brand names and important abbreviations.
Machine Translation Engine Training Technologies
Language Studio provides a wide range of unique and powerful tools for data creation and analysis. Unlike other machine translation providers, we offer a Professional Guided Customization option that has the enables features such as custom writing style and vocabulary. Language Studio provides Do-It-Yourself (DIY) and Professional Guided Customization choice so that our customers can quickly and easily get the highest possible machine translation quality, even when they have little or no data. When data is available from customers, advanced Data Cleaning Technologies and Measurement Technologies ensure that only the highest quality data is used to customize an engine.
- Tuning and Test Set Extraction: Extract high-quality tuning and test set data for use in automated quality metrics and quality comparisons.
- Fully Automated Training: Automatically process and train data via SMT, NMT and other machine learning technologies to create custom machine translation engines.
- Incremental Training: Quickly improve existing machine translation engines with your edited or newly acquired data.
- Multiple Domains in One MT Engine: Professional Custom MT Engines can be built with multiple domains embedded in a single MT engine. At runtime, switch dynamically between different domains based on the input type.
Data Gathering, Data Synthesis, and Data Creation Technologies
Data gathering technologies are essential for modern machine translation engines that require millions of bilingual sentences that are in the correct domain and writing style in order to produce a high-quality machine translation engine. When creating a Professionally Guided Custom Machine Translation Engine, a project manager is assigned that will determine the best approach to achieve the highest possible translation quality. Depending on the data available, volumes of your data, and your goals, different tools and approaches will be defined.
- Domain Matching: Automatically Identifying and classify bodies of text by domain. Detect domains such as medical, information technology, news, politics, etc.
- File Pair Matching: Analyze large volumes of files in 2 or more languages and pair them as sets of translated files.
- Sentence Pair Matching: Also known as Sentence Alignment, match bilingual sentences in a pair of files.
- Monolingual Website Crawling: Download entire websites. Mine monolingual data from the web that is in domain and a similar writing style to the desired translations.
- Bilingual Data Mining and Matching: Mine websites and automatically match bilingual web pages, and match and extract bilingual sentences within each document.
- Data Synthesis: Create millions of synthesized bilingual sentences to enhance your data that are in domain, relevant and high-quality. Leverage your existing translation memories and glossaries to create large volumes of bilingual sentences that specifically match your purpose.
- Data Manufacturing: Create millions of sentences of bilingual data from only in-domain monolingual data.
- Glossary and Terminology Extraction: Automatically analyze monolingual data to extract glossary terms, people names, and organization names.
- Bilingual Glossary Creation: Automatically generate translation suggestions or draw from existing translation memories and termbases. Review the terms and their translations at speeds in excess of 160 terms per hour with advanced AI powered editing tools.
Data Cleaning Technologies
The old saying “Garbage in, garbage out” has never been more important than it is today. Omniscien pioneered the concept of “Clean Data SMT” back in 2006 when the industry and academics were all claiming that more data is better than clean data. The Omniscien team has always had the mindset that learning technologies cannot learn bad patterns if you do not give them bad data to learn from.
All Omniscien data is processed via a series of analysis and cleaning tools to ensure that only clean data is used for training artificial intelligence and machine learning systems.
- Quality Confidence Scores: Uses a variety of approaches to produce metrics that determine the accuracy, fluency, and quality of bilingual sentence pairs.
- Parameter Based Quality Extraction: Extract data from larger bodies of corpora based on quality metrics, scores and other user specified parameters.
- Language ID: Determine the language of each sentence, entire files in various formats, and breakdowns of different languages within each file.
- Encoding: Detect the text encodings of files and convert them to UTF-8 or other desired encodings.
- Noise Removal: Remove noise and junk data from text strings and files.
- Text Extraction: Automatically extract clean sentences from formats such as XML, HTML, Adobe PDF, Microsoft Word, PowerPoint, Excel, Emails and more.
- Sentence Repair: Repair broken and split sentence fragments. Join fragments based on artificial intelligence and rules to make complete sentences. This is especially complex and important for Asian languages such as Thai, Chinese, Japanese, and Korean.
- Domain Identification: Automatically classify data by their domain topic such as medical, information technology, or automotive.
- Anonymization: Detect sensitive information and remove associated sentences or replace the sensitive information with placeholders.
- OCR, Machine Translation, and Pornography Detection: Detect content that is generated from other technologies or otherwise undesirable in nature for training machine learning and artificial intelligence systems.
Measurement is required in a variety of use cases. Measurement is usually related to quality but can also be related to time and performance. Unfortunately, as of yet, there is no perfect way to measure translated content. Omniscien tools use a variety of technologies that when used together can give a clearer picture and pattern for understanding bilingual content quality.
- Machine Translation Quality Analysis: Scores the quality of a machine translation in real-time. This is useful as scores in XLIFF files or in decision trees for Hybrid Machine Translation.
- Bilingual Data Cleaning Quality Analysis: Determines the quality of bilingual data from translation memories and sentence pairs. This is useful when ingesting and cleaning sentence pairs from external sources.
- Automated Source + Reference Quality Metrics: Automated metrics for Bilingual Evaluation Understudy (BLEU), F-Measure, Levenshtein, Edit Distance, Translation Error Rate (TER), Word Error Rate (WER), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Rank-based Intuitive Bilingual Evaluation Score (RIBES), BEtter Evaluation as Ranking (BEER), etc.
Natural Language Processing and Data Processing Technologies
Natural Language Processing (NLP) tools enable workflows to process data to better understand language and language features more easily. These tools analyze and automatically process text for a wide variety of tasks. The list below is just a partial list of the types of NLP related tools that are exposed via LSScript and LSTools in Workflow Studio, Media Studio and Language Studio.
- Language Identification: Identifying the language of sentences and files.
- Sentiment Analysis: Determine if the text is happy, sad, or neutral.
- Syntax Parsing: Grammatically analyze the structure of sentences.
- Part of Speech: Determine the part of speech of each word in a sentence.
- Named Entity Recognition: Extract people names, organization names, locations, dates, currencies, and other entities.
- Term Extraction and Generation: Extract predefined and relevant terms from a given text.
- Domain Identification: Identify the sphere and domain of a given text.
- Document Alignment: Match bilingual documents in different languages as document pairs.
- Sentence Alignment: Match bilingual sentences from pairs of documents.
- Word Stemming: Reduce words back to their stem form.
- Optical Character Recognition (OCR): Analyze images and PDF files and convert them into text or Microsoft Office Documents.
- Automated Speech Recognition (ASR): Recognizing and identifying text from speech.
- Document Conversion: Convert documents between a variety of different formats.
- Web Crawling: Download entire websites.
- Data Mining: Analyze large bodies of content and extract useful data.
- Machine Translation: Translate text across 600+ language pairs.
And many more...
Workflow Automation Technologies
Reliable workflow automation and coordination is an essential part of all Omniscien products. LSScript ties together complex workflows while providing the flexibility and customization capabilities for customers. Workflow Studio's Job Management and Scheduling provides a robust infrastructure to deploy powerful and scalable language processing solutions for almost any language processing need.
- Job Management and Scheduling: Workflow automation is provided in all Omniscien products by Workflow Studio features. Jobs are managed, tracked, scheduled, and executed with full visibility, progress, load balancing, and load distribution. Hierarchical job relationships enable job dependencies and sequencing
Subtitle, Video, Audio, and Media Processing Technologies
Media Studio leverages a wide range of technologies. This list of technologies provides examples of technologies specific to Media Studio that are used in conjunction with automated workflows, managed jobs, and user-driver tasks.
- Automated Speech Recognition (ASR): Also known as Voice Recognition and Transcription, converts audio into text form.
- Speaker Detection: Also known as Speaker diarization, detects and partitions audio data into homogeneous segments based on speaker identity.
- Voice Activity Detection: Determines the boundaries of when a person starts and stops speaking.
- Scene Change Detection: Determines the points in time within a video where a scene changes.
- File Format Conversion: Converts between a range of different media formats such as SRT, WebVT, TTML, DFXP, etc.
- Video Format Conversion and Manipulation: Convert a wide variety of formats, decode, encode, transcode, mux, demux, stream, filter
- Dialog Extraction: Automatically extract dialog, speaker, and other information from directors’ scripts, screenplays, and storyboards.
- Sentence Joining: In order to translate properly, with the full context, and be able to reorder words across languages when translating, sentence fragments need to be joined back to complete and well formed sentences. Artificial intelligence and rules produce complete sentences for translation.
- Sentence Splitting: After translation, sentences need to be split in the correct positions that map to a similar structure to the source text. Sentences are split into fragments and distributed across multiple subtitles in natural positions in a manner that mimics how a human editor would split them.
- Real-Time Editing Tools: Web-based subtitle editing and creation tools with support for multiple concurrent editors and quality assurance team members working on the same file collaboratively together. Changes are synced in real-time between each users browser.
- Project Management Tools: Comprehensive project management, tracking, and scheduling. Projects are driven by Gantt charts with real-time views on task progress. Gantt charts become the basis for workflows and provide a set of machine and human tasks that together deliver considerable productivity gains.
Data Source and Integration Technologies
Many jobs use data that is sourced from a variety of file and data storage locations. Omniscien products are able to read files from a vast array of remote and local data storage locations. A simple REST API can upload, download, or callback and move data between an external REST API. Folders can also be monitored using a wide variety of file storage systems, with jobs starting automatically and files moving to and from file storage locations.
Data sources and integrations include:
- Translation Management Systems: memoQ, MemSource, XTM, Atril DejaVu, Globalsight, SDL Studio, SDL WorldServer, Across, Alchemy Catalyst, STAR Transit, Transperfect GlobalLink Project Director, XTRF, etc.
- Integrated Solution Partners: ABBYY, Basis Technology, KeenCorp, Catalyst Systems, Clay Tablet, Crosslang, Acrolinx, Okapi Framework, Fluency, Plunet, Veda Semantics, etc.
- REST API: A clean and easy to use REST API for all core functionality.
- File Storage Systems FTP and SFTP, Local Disk, REST API (Call Out), REST API (Call In), AWS S3, Google Drive, Dropbox, Box.com, OneDrive Business, and more.
- Email Systems: SMTP, POP, IMAP, GMAIL, MAPI, etc.
Data Conversion Technologies
With the wide range of file formats used in the translation, media, and data process industries it is often necessary to convert and normalize file formats.
Supported data conversion formats include:
- Document Format Conversions: Compiled HTML Format (CHM), Electronic Publisher Format (EPUB), Hypertext Markup Language (HTML), Microsoft Word (DOC/DOCX), Microsoft Excel (XLS, XLSX), Microsoft PowerPoint (PPT/PPTX), Open Office (ODF), Adobe Portable Document Format (PDF), Rich Text Format (RTF), Word Perfect Format (WP/WP6/WP7).
- Data Format Conversions: Comma Separated Values (CSV), Tab Separated Values (TSV), Plain Text (TXT), Tab Delimited (TAB), Extensible Markup Language (XML).
- Translation Memory and Termbase Format Conversions: Term Base Exchange (TBX), Translation Memory Exchange (TMX), XML Localization Interchange Format (XLIFF/XLF/MQXLIFF,SDLXLIFF)
- Mail Format Conversions: Microsoft Outlook MSG Format (MSG), Microsoft Outlook PST Format (PST), Microsoft Transport Neutral Encoding Format (TNEF), RFC822 Mail (RFC822)
- Director Script, Screenplay and Storyboard Conversions: Adobe PDF, Microsoft Word, Microsoft Excel, Images
- Subtitle Format Conversions: Timed Text Markup Language (TTML), SubRip Text (SRT), Screen Subtitling Systems PAC Format (PAC), Distribution Format Exchange Profile (DFXP), Web Video Text Tracks (WebVTT), Spruce subtitle (STL), Ogg Writ (OGG), Synchronized Accessible Media Interchange (SAMI), Presentation Graphics Stream (PGS), DVB Subtitling (ETSI 300 743), MPEG-4 Timed Text (3GPP), Consumer Technology Association 708 (CEA-708), Consumer Technology Association 608 (EIA-608/CEA-608) .
- Metric Format Conversions: Acceleration, Amount of substance, Angle, Angular Acceleration, Angular Velocity, Area, Density, Electrical Current, Energy, Force, Frequency, Impulse, Length, Luminous intensity, Mass, Moment of a Force, Power, Pressure, Stress, Temperature, Time, Velocity, Volume, Work.
- Feed / Syndication Conversions: Atom Syndication Format (ATOM), IPTC ANPA News Wire feed format (ANPA), Rich Site Summary (RSS)
- Image Format Conversions: Bitmap (BMP/DIB/RLE), Joint Photographic Experts Group 2000 (JP2/J2K/JPG/JPEG), Joint Bi-level Image Experts Group (JB2/JBIG2), PC Paintbrush Format (PCX), Multipage PCX (DCX), Portable Network Graphics (PNG), Tagged Image File Format (TIF/TIFF), Adobe Portable Document Format (PDF), DjVu (DJVU,DJV), Graphics Interchange Format (GIV), Windows Media Photo (WDP)
- Video Format Conversion: MPEG-1 Part 2, H.261 (Px64), H.262/MPEG-2 Part 2, H.263, MPEG-4 Part 2, H.264/MPEG-4 AVC, HEVC/H.265 (MPEG-H Part 2), Motion JPEG, IEC DV video and CD+G, AV1, SMPTE 314M (a.k.a. DVCAM and DVCPRO), SMPTE 370M (a.k.a. DVCPRO HD), VC-1 (a.k.a. WMV3), VC-2 (a.k.a. Dirac Pro), VC-3 (a.k.a. AVID DNxHD), Animated GIF, AVS video, Microsoft RLE, Microsoft Video 1, Cinepak, Indeo (v2, v3 and v5), Microsoft MPEG-4 v1, v2 and v3, Windows Media Video (WMV1, WMV2, WMV3/VC-1), WMV Screen and Mimic codec, RTV 2.1 (Intel Indeo 2), RealVideo Fractal Codec (a.k.a. Iterated Systems ClearVideo), 1, 2, 3 and 4, Cinepak (Apple Compact Video), ProRes, Sorenson 3 Codec, QuickTime Animation (Apple Animation), QuickTime Graphics (Apple Graphics), Apple Video, Apple Intermediate Codec and Pixlet, Screen video, Screen video 2, Sorenson Spark and VP6, Theora, Duck TrueMotion 1, Duck TrueMotion 2, Duck TrueMotion 2.0 Real Time, VP3, VP4, VP5, VP6, VP7, VP8, VP9 and animated WebP, Smacker video and Bink video, TXD, Silicon Graphics RLE 8-bit video, Silicon Graphics MVC1/2, IBM UltiMotion, Avid 1:1x, Avid Meridien, Avid DNxHD and DNxHR, Autodesk Animator Studio Codec and FLIC, HQ, HQA, HQX and Lossless, SpeedHQ, APNG, Matrox Uncompressed SD (M101) / HD (M102), ATI VCR1/VCR2, ASUS V1/V2 codec.
- Audio Format Conversions: MP1, MP2, MP3, AAC, HE-AAC, MPEG-4 ALS, G.711 μ-law, G.711 A-law, G.721 (a.k.a. G.726 32k), G.722, G.722.2 (a.k.a. AMR-WB), G.723 (a.k.a. G.726 24k and 40k), G.723.1, G.726, G.729, G.729D, IEC DV audio and Direct Stream Transfer, SMPTE 302M, Full Rate (GSM 06.10), AC-3 (Dolby Digital), Enhanced AC-3 (Dolby Digital Plus) and DTS Coherent Acoustics (a.k.a. DTS or DCA), MLP / Dolby TrueHD, DTS Coherent Acoustics (a.k.a. DTS or DCA), DTS Extended Surround (a.k.a. DTS-ES), DTS 96/24, DTS-HD High Resolution Audio, DTS Express (a.k.a. DTS-HD LBR), DTS-HD Master Audio, QDesign Music Codec 1 and 2, AMR-NB, AMR-WB (a.k.a. G.722.2), QCELP-8 (a.k.a. SmartRate or IS-96C), QCELP-13 (a.k.a. PureVoice or IS-733) and Enhanced Variable Rate Codec (EVRC. a.k.a. IS-127), iLBC (via libilbc), Opus and Comfort noise, DSS-SP, Windows Media Audio (WMA1, WMA2, WMA Pro and WMA Lossless), XMA (XMA1 and XMA2), MS-GSM and MS-ADPCM, IMA ADPCM, DVI4 audio codec, RealAudio v1 – v10, ALAC, Adobe SWF ADPCM and Nellymoser Asao, Speex (via libspeex), Vorbis, Opus and FLAC, Adaptive Transform Acoustic Coding (ATRAC1, ATRAC3, ATRAC3Plus and ATRAC9) and PSX ADPCM, TwinVQ, DK ADPCM Audio 3/4, On2 AVC and iLBC (via libilbc), Truespeech.
- Encoding Format Conversions: Big5, Big5-HKSCS, Big5-Solaris, EUC-JP, euc-jp-linux, eucJP-Open, EUC-KR, EUC-TW, GB18030, GB2312, GBK, IBM00858, IBM01140, IBM01141, IBM01142, IBM01143, IBM01144, IBM01145, IBM01146, IBM01147, IBM01148, IBM01149, IBM037, IBM1006, IBM1025, IBM1026, IBM1046, IBM1047, IBM1097, IBM1098, IBM1112, IBM1122, IBM1123, IBM1124, IBM1381, IBM1383, IBM273, IBM277, IBM278, IBM280, IBM284, IBM285, IBM297, IBM33722, IBM420, IBM424, IBM437, IBM500, IBM737, IBM775, IBM834, IBM850, IBM852, IBM855, IBM856, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM868, IBM869, IBM870, IBM871, IBM874, IBM875, IBM918, IBM921, IBM922, IBM930, IBM933, IBM935, IBM937, IBM939, IBM942, IBM942C, IBM943, IBM943C, IBM948, IBM949, IBM949C, IBM950, IBM964, IBM970, IBM-Thai, ISCII91, ISO-2022-CN, ISO2022-CN-CNS, ISO2022-CN-GB, ISO-2022-JP, ISO-2022-KR, ISO-8859-1, iso-8859-11, ISO-8859-13, ISO-8859-15, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, JIS_X0201, JIS_X0212-1990, JIS0208, JISAutoDetect, Johab, KOI8-R, KOI8-U, MacArabic, MacCentralEurope, MacCroatian, MacCyrillic, MacDingbat, MacGreek, MacHebrew, MacIceland, MacRoman, MacRomania, MacSymbol, MacThai, MacTurkish, MacUkraine, MS950-HKSCS, mswin-936, PCK, Shift_JIS, SJIS_0213, TIS-620, US-ASCII, UTF-16, UTF-16BE, UTF-16LE, UTF-16LE-BOM, UTF-32, UTF-32BE, UTF-32BE-BOM, UTF-32LE, UTF-32LE-BOM, UTF-8, Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257, Windows-1258, Windows-31j, Windows-50220, Windows-50221, Windows-874, Windows-949, Windows-950, Windows-iso2022jp.