Speaker Recognition

Project Definition:

Identifying which celebrity is speaking through deep neural networks. Applications for this project include machines matching commands with individuals through their voices, such that in the future, the machines can anticipate personal commands.

Process:

1) Collected data from VoxCeleb[1], a database of celebrities’ voices and images.
2) Extracted audio features referencing Aaqib Saeed’s code

MFCC: Mel-frequency cepstral coefficients
Melspectrogram: Compute a Mel-scaled power spectrogram
Chorma-stft: Compute a chromagram from a waveform or power spectrogram. “In music, the term chroma feature or chromagram closely relates to the twelve different pitch classes. Chroma-based features, which are also referred to as “pitch class profiles”, are a powerful tool for analyzing music whose pitches can be meaningfully categorized and whose tuning approximates to the equal-tempered scale.” (Wikipedia)
Spectral Contrast: Compute spectral contrast. Spectral contrast is defined as the level difference between peaks and valleys in the spectrum
Tonnetz: Computes the tonal centroid features (tonnetz)
3) Store features in pandas dataframe and prepare data for modelling
4) Run models

Visualization examples of mel and tonnetz features:

Miranda Cosgrove’s Mel	Smokey Robinson’s Mel

Miranda Cosgrove’s Tonnetz	Smokey Robinson’s Tonnetz

Best Model:

The best model was an 18 layer - CNN model using selu activation function, yielding 0.73 F1-score.

Error and Validation Plots:

ROC Curve for all celebrities:

Demo:

Challenged my audience to try and guess the celebrity from an audio clipped I played. I then ran my model to try and determine the person as well. My model won with 3 more correct guesses than the audience.

Citation:
[1] A. Nagrani, J. S. Chung, A. Zisserman
VoxCeleb: a large-scale speaker identification dataset
INTERSPEECH, 2017