Automatic Speech Recognition (ASR) - Technical Features

Slide 1

Automatic Speech Recognition / ASR

Technical Features

Language Studio provides best-in-class AI based voice recognition technologies that are tightly integrated with Machine Translation (MT) and Natural Language Processing (NLP) tools.

Access all the features via our secure portal, integrated applications or REST APIs.

Omniscien » Language Studio » Features » ASR » Automatic Speech Recognition (ASR) – Technical Features

Language Studio:	Home Features Secure Portal Server Platform Data Privacy & Compliance Book a Demo
	Convert Files Images & OCR Media Processing Natural Language Processing Transcribe & Dictate Translate

Block

User Experience

Transcribe and Dictate

Media Tools

Voice Processes

Technical Features

Integrate and Scale

Languages

Speak to Me!!

Accuracy & Customization

Only the best!!

Best in class AI driven voice recognition
and machine translation deliver

real-time transcriptions. live captions and subtitles. professionally formatted meeting minutes. connections for the hard of hearing. broadcast content across language barriers. video file transcriptions. as spoken directors scripts.

Technical Features Overview

Real-Time & Batch Processing

Transcribe and translate in real-time.
Receive partial transcriptions and completed sentences in real-time.
Process multiple files concurrently in batch mode.
Produce multiple outputs from the same audio.
Mix multiple streams and channels to create a final transcript.
Submit via REST API, our easy to use web interface, or integrated applications.

48 Spoken Languages

Arabic, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Slovakian, Slovenian, Spanish, Swedish, Tamil, Thai, Turkish, Ukraine, Vietnamese, and more…
Many more languages are in development.

Click to see the full list of supported languages.

Speed and Accuracy

Consistently outperforms Google and Microsoft for accuracy
ASR latency is as little as 1 second.
Translation latency is as little as 50 milliseconds.
Configurable latency for the right balance between speed and accuracy.
Custom glossary and quickly train with your own custom data for even greater accuracy.
Support for a wide range of accents in a single language pack.
Select a specialized industry domain for even higher accuracy. Domains include, finance, medical, information technology and more.

Learn More >>

Output Formats

Transcription text with accurate punctuation, paragraphs and sentence boundaries.
Rich JSON data with words, parts of speech, start and end time, refined punctuation, sentiment, named entities, speaker, channel, confidence scores, and more.
Subtitles and Closed Captions are formatted based on custom configuration to produce splits and timing that feel natural.
Microsoft Word formatted templates for meeting minutes, interview transcriptions, etc.

Speaker/Channel Diarization

Automatically detect and label different speakers on up to six streams or channels.
Detect and label different speakers within the same channel.
Easily identify a change of speaker within your transcript and improve its readability.
Fine tune speaker detection algorithms with configurable settings.

Powerful Server and API for Integration

Enterprise class scalability and processing features.
Scale to thousands of audio streams for real-time processing.
Automatically identify the language being spoken.
Submit batch files for processing via API.
Integrate your own voice applications seamlessly.
Dynamically scale server resources up and down based on demand.

Understanding Core Technical Features

Transcription Modes

Batch File Transcription

Transcribe pre-recorded media files via our secure document portal or application user interfaces, or integrate via RESTful API.

Control and Customization: Custom glossary/dictionary, Industry Domains, Customized Domain Adaptation – Learn More
Times: Start and end times for each word, and sentence.
Accuracy: Latency vs. accuracy control, confidence scores – Learn More
Formatting: Punctuation, Entity Formatting, and Sentence Boundary detection
Diarization: Speaker and channel detection, speaker change detection.
NLP: Sentiment analysis, terminology extraction, syntax, etc.
Machine Translation: Support for 600+ language pairs.

Real-Time Transcription

Transcribe from any audio source in real-time.

The same features as Batch File mode but in real-time and some great additional options.
Subtitles: Generate live subtitles in multiple languages – all beautifully formatted and in real-time.
Share: Share real-time output events with multiple client applications concurrently from a single audio stream.
Partial Transcriptions: Show partial sentences even before a sentence is complete.

Translation and Natural Language Processing (NLP)

Get more value from voice data with NLP and translation

Make your voice apps smarter. Use Natural Language Processing tools to get more from your data. Easily enable applications to extract context, syntax, parts of speech, key terms, sentiment, meaning, summarize voice content, and even translate your voice content into other languages.

Translate each sentence as it is transcribed in about 50 milliseconds.
Configure your voice workflows to enrich your voide data with more information.
Enable secondary processes to be smarter and more useful by adding intelligent metadata to the voice text.

Output Formats

Convert Voice to a Variety of Output Formats

Language Studio produces voice document to match a wide variety of use cases.

Plain Text Transcription: A simple, clean text file containing the words and sentences spoken.
Enriched JSON Data: Includes times, speaker information, channel information, start time, end time, refined punctuation, named entities, confidence scores, translations, sentiment and more.
Subtitles and Closed Captions: Create well formatted SRT, TTML, DFXP and VTT subtitle and caption files.
- Format voice data based on custom configuration for basic rules such as minimum and maximum display time, characters per line, lines per subtitle, reading speed, gaps between subtitles, speaker change formatting, etc.
- Split subtitles and lines using artificial intelligence in natural positions that make the output feel as if a human had structured the subtitle.
Microsoft Office: Merge variables and Voice JSON data into Microsoft Word and Excel custom templates to produce beautiful meeting minutes, interview transcriptions, notes, etc.

Scalability

Scale to Thousands of Streams and Users

Seamlessly scale to large numbers of concurrent streams.

Each audio stream is an independent Docker container.
Multiple Docker containers within a single host can be started and stopped as needed.
Multiple users/clients can listen to events to the same stream concurrently, saving resouces.
Secured stream output can be accessed only with authentication.
Compatible with AWS, Google Cloud, Microsoft Azure and other virtual environemnts.

Accurate Punctuation, Sentence Boundaries, Entity Formatting and Locales

A great balance between control and artificial intelligence

There is no one-size-fits-all in voice. There is also no punctuation or boundaries between sentences. Artificial intelligence combined with some user-defined rules ensures that you get voice data as text in the format that you need it.

Sentence Boundaries: Use AI to automatically detect sentence boundaries / end of sentence with a spoken voice.
Punctuation: Use AI to insert punctuation such as commas, periods, question marks and exclamation marks. Override the automated punctuation with your own punctuation rules.
Local Formatting: Control the local formatting style such as EN-GB, EN-US and produce different spelling and formatting based on a specific Locale.
Entity Formatting: Control entity formats by specifying written or spoken form (“one million dollars” vs. “$1 million”).
Latency: Flexible and fixed end points in combination with configurable parameters that control the maximum duration before returning data provide an means to balance speed and accuracy.

REST and WebSockets API

Integreate ASR into your Applications

Use the RESTful and WebSockets APIs to power your applications with Language Studio’s artificial intelligence based tools.

APIs include:

Voice Transcription / Speech Recognition
Voice Language Identification
Voice to Document Alignment
Machine Translation
Sentiment Analysis
Term Extraction
Text Language Identification
Many more Natural Language Processing (NLP) features.

Accuracy and Customization

Autonomous Speech Recognition is leading the way

Autonomous Speech Recognition (ASR) is a new approach to speech recognition development that is delivering unprecedented accuracy and inclusion. ASR uses the latest techniques in artificial intelligence and machine learning, including the introduction of self-supervised learning models, allowing for much larger datasets from a broader set of voices. Previously unutilized voice recordings from thousands of speakers can now be utilized, even those with different accents, age, or genders, reducing bias, and improving inclusion while delivering unprecedented accuracy, control and customization.

Accuracy: Unmatched accuracy. See how we compare against Microsoft, Google and others >>
Bias and Coverage: Due to being able to utilize more than 32 times the amount of voice learning data compared to traditional approaches, our ASR engines consistently outperform all the other industry players for gender bias, accent bias, age bias, and handling of background noise. This translates to significantly better inclusion of all voices.
Customize your ASR Engine: Take control of your ASR engine and make it your own. Learn More >>
- Custom Dictionary / Glossary: Specify up to 1,000 custom dictionary/glossary terms per language to fine tune your ASR engine on specific words.
- Industry Domain ASR: Select a Pre-Trained Industry Specific ASR Engine. Domains include Automotive, Banking and Finance, E-Discovery, Engineering and Manufacturing, General Purpose, Information Technology, Medicine and Life Sciences, Military Intelligence and Defense, News and Media, Patents and Legal, Politics and Government, Retail and E-Commerce, Subtitles and Dialog, and Travel and Hospitality.
- Customize your Own Specific Domain: Omniscien offers a full customization service. We can have a customized domain ready for you in hours or days (varies depending on the scope).

Speaker and Chanel Diarization

Automatically Add Speaker and Channel Labels

It is often useful to know not just what was said but also who said it. Language Studio provides four different modes for separating out different speakers in the audio to support different use cases.

Type	Description	Use Case
Speaker Diarization	Aggregates all audio channels into a single stream for processing and picks out unique speakers based on acoustic matching.	Used in cases where there are multiple speakers embedded in the same audio recording and it’s required to understand what each unique speaker said.
Channel Diarization	Transcribes each audio channel separately and treats each channel as a unique speaker.	Used when it’s possible to record each speaker on separate audio channels.
Speaker Change	Provides the point in transcription when there is believed to be a new speaker.	Used for when you just need to know the speaker has changed usually in a real-time application.
Channel Diarization & Speaker Change	Transcribes each audio channel separately and within each channel provides the point when there is believed to be a new speaker.	Used when it’s possible to record some speakers on a separate audio channel, but some channels there are multiple speakers.