6 Proven Methods to Customizing Speech Data Collection

6 min readApr 13, 2022

6 Proven Methods to Customizing Speech Data Collection

There are several different types of clients — some have a clear idea of how their speech data should be structured, and some are more flexible with their approach.

As a service provider, we have to make sure both the client’s requirements are fulfilled. However, with a client who is flexible with their requirements, it is possible that they haven’t fully given speech data collection a complete thought.

This is where the contribution of the speech dataset provider comes into play.

We have the responsibility to showcase the points to be kept in mind before starting the audio data collection project so as to allow the AI organizations to identify a feasible, efficient and cost-effective solution.

The voice recognition market, in the world, is expected to grow to $27.16 billion in 2026 from $10.7 billion in 2020 at a CAGR of 16.8%.

Let’s look at all the effective ways or points to be kept in mind before customizing the speech data collection project.

Languages and demographics
Collection Size
Structure of the Script
Audio requirements and formats
Delivery and Processing Requirements
Other Crucial Points to Note

Languages and demographics

The project should first specify the target languages and target demographic.

Languages and Dialect

Start by keeping the project requirement in mind — the languages for which the speech dataset is being collected and customized. Also, understand the specific proficiency requirement. For instance, should the participant be a native speaker or a non-native speaker?

For example — Native English Speakers

Running close on the heels of language is dialect. To make sure the dataset doesn’t suffer from biases, it is advisable to intentionally introduce dialects to accommodate for diversity in participants.

For example — Australian English- accented Speakers

Countries

Before customizing, it is important to know if there is a specific requirement that the participants should come from specific countries. And, whether the participants should currently live in a specific country.

For example — Punjabi is spoken differently in India and Pakistan.

Demographics

Besides language and geography, the customization can also be done based on demographics. Target distribution of participants based on their age, sex, educational qualification, and more can also be done.

For example — Adults Vs Children or Educated vs Uneducated

Collection size

Your dataset will impact the performance of your data project. However, the collection data size you need will also determine the participants required.

The Total Number of Respondents

Determine the total number of participants that will be required for the project. In case the project requires language audio data collection, you should analyze the total number of participants required per targeted language.

For example — 50% American English and 50% Australian English Speakers

The Total Number of Utterances

To build the speech data collection, determine the total number of utterances or repetitions per participant or the total repetitions needed.

For example — 50 participants with 25 utterances per participant = 1250 repetitions

Script structure

The script can also be customized to meet the needs of the project, so it is advisable to seek the help of speech therapists to design the flow of text. If the ML model has to be trained on well-structured data, it has to take into consideration the script and workflow.

Scripted vs Unscripted

You can choose between using a scripted text or a natural or unscripted text to be read by the participants.

In a scripted text speech, the participants read what is displayed on the screen. This method is, mostly, used to record commands or instructions.

For example — ‘Turn off the music,’ ‘Press 1 to record.’

In the unscripted speech, the participants are given scenarios and asked to frame their sentences and speak as naturally as possible.

For example — ‘Can you please tell me where the next gas station is?’

Utterance Collection / Wakeup Words

In case scripted text is used, you have to decide the number of scripts that will be used, and whether each participant will be reading a unique script or a group of scripts. Also, determine if the script contains a collection of wake words and commands.

For example –

Command 1:

“Alexa, what is the recipe for a chocolate cupcake?”

“Ok Google, what is the recipe for a chocolate cupcake?”

“Siri, what is the recipe for a chocolate cupcake?”

Command 2:

“Alexa, when is the flight to New York?”

“Google, when is the flight to New York?”

“Siri, when is the flight to New York?”

Audio requirements and formats

Audio quality plays a crucial role in the speech recognition data collection process. Distracting background noises can negatively impact the quality of collected voice notes. This might also decrease the effectiveness of the voice recognition algorithm.

Audio Quality

The quality of the recordings and the presence of background noise can impact the outcome of the project. But some speech data collections accept the presence of noise. However, it is advisable to have a better understanding of the requirements in terms of bit rate, signal-to-noise ratio, amplitude, and more.

Format

The file format, data points, content structure, compression, and post-processing requirements also determine the quality of speech recordings.

The reason for the importance of file formats is that the model has to identify the file output and be trained to recognize that particular sound quality.

Define Custom Audio Requirement

Custom audio requirements should be mentioned before the beginning of the collection process. Clients can choose customized audio files where specific files are clubbed together.

Delivery and Processing Requirements

Once the speech data is gathered, the clients can choose to have it delivered according to their requirements.

Transcription and Annotation requirement

Some clients require data transcription and labeling before they deliver. Additionally, they might also require specific forms of labeling and segmentation.

Sometimes it is better to seek speech-language pathologists and experts to help in transcribing speech in various languages to maintain the authenticity of the target language.

File naming conventions

The data collection forms should specify any file naming convention to be followed. If the naming convention is complex or beyond the standard scope of the process, it could attract extra developmental costs.

Delivery Guidelines

Security and delivery guidelines should be followed as specified in the project requirements. Moreover, if the data is to be delivered in small milestones or as a complete package at once should be specified. Clients also prefer timely progress monitoring updates so that they can keep track of the project status.

Other Crucial Points to Note

The customizations will impact how,

Data collection methods used
The recruitment of participants
The timeline for delivery
The Tentative Cost of the project

When selecting the right vendor, you have to make sure you go with someone who has both the experience to provide customization choices and flexibility to scale the project effortlessly. The nature of speech data collection is that it evolves and the complexities change over time, and the right provider should be able to keep pace.

When all you need is flexibility and scalability, Shaip is the right choice. We offer customizable services based on your specific project requirements. We offer scalable and flexible data collection solutions for multilingual projects at competitive prices. Talk to our experts to know how our speech data collection and customization techniques work in developing conversational AI.

Originally published at https://www.shaip.com.