when we speak , we create sinusodal vibrations in the air

sound -> audio signal Sound vibrations cause pressure waves in the air that can be detected with a microphone and transduced into a signal.

alt text

Audio Signal for Hello World

two blobs for word (hello , world)

amplitude: is how loud the sound / how smuch energy

speech is made of multiple frequencies

use Fast Fourier Transform to convert sound to component frequencies

alt text

Fourier Analysis is the study decomposing mathematical functions into sums of simpler trigonometric functions. Since sound is comprised of oscillating vibrations, we can use Fourier analysis, and Fourier transforms to decompose an audio signal into component sinusoidal functions at varying frequencies.

A spectrogram is the frequency domain representation of the audio signal through time. It's created by splitting the audio signal into component frequencies and plotting them with respect to time. The intensity of color in the spectrogram at any given point indicates the amplitude of the signal. The following reference includes interesting slides showing how sounds in spectrograms can be "read" by experts.

Fast Fourier Transform, is an efficient implementation of a Discrete Fourier Transform. The algorithm transforms a sum of sinusoidal signals into into its pure frequency components.

# Mel Frequency Cepstral Coefficents (MFCCs)

Mel Frequency Cepstrum Coefficient Analysis is the reduction of an audio signal to essential speech component features using both mel frequency analysis and cepstral analysis.

The range of frequencies are reduced and binned into groups of frequencies that humans can distinguish. The signal is further separated into source and filter so that variations between speakers unrelated to articulation can be filtered away.

There is 12-13 mfcc features. And upto 39 total with optional deltas

alt text

← Non-negative Matrix Factorization (NMF) Phoenetics →