VAD + resampling | High resolution spectrogram
VAD + Resampling
Although the words are short, there is a lot of silence in them. A decent VAD can reduce training size a lot, accelerating training speed significantly. Let’s cut a bit of the file from the beginning and from the end.
Frequently related frequencies of speech exist in the lower bands (~8000Hz)
main.py
to where the .wav
files are located.main.py
python3 main.py [--opt OPT] [--path PATH]
Preprocessing of Speech
optional arguments:
--opt OPT preprecessing mode : vad=1, resampling=2, vad+resampling=3 (default: 3)
--path PATH wav file location (default: current directory)
Code that runs FFTs of several window sizes, aligns their centers, and then applies mel weighting to combine them.
With single FFTs, short windows have good time resolution but lack frequency breadth (no lower frequencies), whereas long windows have good frequency breadth but lack time precision (windows contain many wavelengths at higher frequencies). Here we combine FFTs of varying window length to tackle this.
python3 high_resolution_mel_spectrogram.py [--path PATH]
Preprocessing of Speech
optional arguments:
--path PATH preprocessed(VAD/resampling) wav file location (default: current directory)