Friday, April 19, 2024

Spoken language recognition on Mozilla Widespread Voice — Audio Transformations. | by Sergey Vilov | Aug, 2023

Must read

Towards Data Science
Picture by Kelly Sikkema on Unsplash

That is the third article on spoken language recognition primarily based on the Mozilla Widespread Voice dataset. In Half I, we mentioned knowledge choice and knowledge preprocessing and in Half II we analysed efficiency of a number of neural community classifiers.

The ultimate mannequin achieved 92% accuracy and 97% pairwise accuracy. Since this mannequin suffers from considerably excessive variance, the accuracy might doubtlessly be improved by including extra knowledge. One quite common strategy to get further knowledge is to synthesize it by performing numerous transformations on the out there dataset.

On this article, we are going to contemplate 5 standard transformations for audio knowledge augmentation: including noise, altering velocity, altering pitch, time masking, and reduce & splice.

The tutorial pocket book might be discovered right here.

For illustration functions, will use the pattern common_voice_en_100040 from the Mozilla Widespread Voice (MCV) dataset. That is the sentence The burning hearth had been extinguished.

import librosa as lr
import IPython

sign, sr = lr.load('./remodeled/common_voice_en_100040.wav', res_type='kaiser_fast') #load sign, charge=sr)

Unique pattern common_voice_en_100040 from MCV.
Unique sign waveform (picture by the writer)

Including noise is the only audio augmentation. The quantity of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal sign amplitude and normal deviation of noise. We’ll generate a number of noise ranges, outlined with SNR, and see how they alter the sign.

SNRs = (5,10,100,1000) #Sign-to-noise ratio: max amplitude over noise std

noisy_signal = {}

for snr in SNRs:

noise_std = max(abs(sign))/snr #get noise std
noise = noise_std*np.random.randn(len(sign),) #generate noise with given std

noisy_signal[snr] = sign+noise[5], charge=sr))[1000], charge=sr))

Alerts obtained by superimposing noise with SNR=5 and SNR=1000 on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform for a number of noise ranges (picture by the writer)

So, SNR=1000 sounds virtually just like the unperturbed audio, whereas at SNR=5 one can solely distinguish the strongest components of the sign. In follow, the SNR stage is hyperparameter that will depend on the dataset and the chosen classifier.

The only strategy to change the velocity is simply to fake that the sign has a distinct pattern charge. Nonetheless, this may also change the pitch (how low/excessive in frequency the audio sounds). Rising the sampling charge will make the voice sound greater. For example this we will “enhance” the sampling charge for our instance by 1.5:, charge=sr*1.5)
Sign obtained through the use of a false sampling charge for the unique MCV pattern common_voice_en_100040 (generated by the writer).

Altering the velocity with out affecting the pitch is tougher. One wants to make use of the Section Vocoder(PV) algorithm. Briefly, the enter sign is first break up into overlapping frames. Then, the spectrum inside every body is computed by making use of Quick Fourier Transformation (FFT). The enjoying velocity is then modifyed by resynthetizing frames at a distinct charge. For the reason that frequency content material of every body shouldn’t be affected, the pitch stays the identical. The PV interpolates between the frames and makes use of the part data to realize smoothness.

For our experiments, we are going to use the stretch_wo_loop time stretching perform from this PV implementation.

stretching_factor = 1.3

signal_stretched = stretch_wo_loop(sign, stretching_factor), charge=sr)

Sign obtained by various the velocity of the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after velocity enhance (picture by the writer)

So, the length of the sign decreased since we elevated the velocity. Nonetheless, one can hear that the pitch has not modified. Be aware that when the stretching issue is substantial, the part interpolation between frames may not work nicely. Consequently, echo artefacts could seem within the remodeled audio.

To change the pitch with out affecting the velocity, we are able to use the identical PV time stretch however fake that the sign has a distinct sampling charge such that the whole length of the sign stays the identical:, charge=sr/stretching_factor)
Sign obtained by various pitch of the unique MCV pattern common_voice_en_100040 (generated by the writer).

Why can we ever trouble with this PV whereas librosa already has time_stretch and pitch_shift features? Effectively, these features remodel the sign again to the time area. When it’s worthwhile to compute embeddings afterwards, you’ll lose time on redundant Fourier transforms. However, it’s straightforward to switch the stretch_wo_loop perform such that it yields Fourier output with out taking the inverse remodel. One might in all probability additionally attempt to dig into librosa codes to realize comparable outcomes.

These two transformation had been initially proposed within the frequency area (Park et al. 2019). The concept was to avoid wasting time on FFT through the use of precomputed spectra for audio augmentations. For simplicity, we are going to show how these transformations work within the time area. The listed operations might be simply transferred to the frequency area by changing the time axis with body indices.

Time masking

The concept of time masking is to cowl up a random area within the sign. The neural community has then much less possibilities to study signal-specific temporal variations that aren’t generalizable.

max_mask_length = 0.3 #most masks length, proportion of sign size

L = len(sign)

mask_length = int(L*np.random.rand()*max_mask_length) #randomly select masks size
mask_start = int((L-mask_length)*np.random.rand()) #randomly select masks place

masked_signal = sign.copy()
masked_signal[mask_start:mask_start+mask_length] = 0, charge=sr)

Sign obtained by making use of time masks transformation on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after time masking (the masked area is indicated with orange) (picture by the writer)

Lower & splice

The concept is to interchange a randomly chosen area of the sign with a random fragment from one other sign having the identical label. The implementation is nearly the identical as for time masking, besides {that a} piece of one other sign is positioned as an alternative of the masks.

other_signal, sr = lr.load('./common_voice_en_100038.wav', res_type='kaiser_fast') #load second sign

max_fragment_length = 0.3 #most fragment length, proportion of sign size

L = min(len(sign), len(other_signal))

mask_length = int(L*np.random.rand()*max_fragment_length) #randomly select masks size
mask_start = int((L-mask_length)*np.random.rand()) #randomly select masks place

synth_signal = sign.copy()
synth_signal[mask_start:mask_start+mask_length] = other_signal[mask_start:mask_start+mask_length], charge=sr)

Artificial sign obtained by making use of reduce&splice transformation on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after reduce&splice transformation (the inserted fragment from the opposite sign is indicated with orange) (picture by the writer)

Supply hyperlink

More articles


Please enter your comment!
Please enter your name here

Latest article