Computer Audition Toolbox documentation 

The CATbox is maintained by Shlomo Dubnov and Mehrdad Yazdani.
WAVREAD/WRITE MP3READ/WRITE MIDIREAD/WRITE 2. Representation and some applications STFT Const Q Inst freq AR models (LPC) Sinusoidal models (YASA) MFCC ERB filters Onset Detection Spectral Envelop

Timewarping Factor Oracle Suffix Trees Euclidean KullbackLeibler 6.Selfsimilarity and clustering Recurrence/similarity matrix Kmeans GMM HMM SVM 
1. Data i/o
This section deals with reading/writing digital audio in MATLAB. The file types that we consider are WAV, MP3, and MIDI.
WAVREAD
Syntax:
y =
WAVREAD(filename)
[y,Fs,bits] = WAVREAD (filename)
[...] = WAVREAD
(filename,N)
[...] = WAVREAD (filename,[N1 N2])
siz = WAVREAD
(filename,'size')
Comments:
wavread supports multichannel data, with up to 32 bits per sample, and supports reading 24 and 32bit .wav files.
y = wavread(filename) loads a WAVE file specified by filename, returning the sampled data in y. The filename input is a string enclosed in single quotes. The .wav extension is appended if no extension is given. Amplitude values are in the range [1,+1].
[y,Fs,bits] = wavread(filename) returns the sample rate (Fs) in Hertz and the number of bits per sample (bits) used to encode the data in the file.
[...] = wavread(filename,N) returns only the first N samples from each channel in the file.
[...] = wavread(filename,[N1 N2]) returns only samples N1 through N2 from each channel in the file.
siz = wavread(filename,'size') returns the size of the audio data contained in the file in place of the actual audio data, returning the vector siz = [samples channels].
WAVWRITE
Syntax:
WAVWRITE(y,filename)
WAVWRITE(y,Fs,filename)
WAVWRITE
(y,Fs,N,filename)
Comments:
wavwrite writes data to 8, 16, 24, and 32bit .wav files.
wavwrite(y,filename) writes the data stored in the variable y to a WAVE file called filename. The filename input is a string enclosed in single quotes. The data has a sample rate of 8000 Hz and is assumed to be 16bit. Each column of the data represents a separate channel. Therefore, stereo data should be specified as a matrix with two columns. Amplitude values outside the range [1,+1] are clipped prior to writing.
wavwrite(y,Fs,filename) writes the data stored in the variable y to a WAVE file called filename. The data has a sample rate of Fs Hz and is assumed to be 16bit. Amplitude values outside the range [1,+1] are clipped prior to writing.
wavwrite(y,Fs,N,filename) writes the data stored in the variable y to a WAVE file called filename. The data has a sample rate of Fs Hz and is Nbit, where N is 8, 16, 24, or 32. For N < 32, amplitude values outside the range [1,+1] are clipped.
Note 8, 16, and 24bit files are type 1 integer pulse code modulation (PCM). 32bit files are written as type 3 normalized floating point.
MP3READ
Syntax:
[Y,FS,NBITS,OPTS] = MP3READ(FILE,N,MONO,DOWNSAMP)
Comments:
Reads MP3 audio file.
FILE is an MP3 audio file name.
Optional scalar N limits the number of frames to read, or, if a twoelement vector, specifies the first and last frames to read.
Optional flag MONO forces result to be mono if nonzero.
Optional factor DOWNSAMP performs downsampling by 2 or 4.
Y returns the audio data, one channel per column; FS is the sampling rate. NBITS is the bit depth (always 16). OPTS.fmt is a format info string.
File history:
20030720 dpwe@ee.columbia.edu  This version calls mpg123.
20040831 Fixed to read whole files correctly
20040908 Uses mp3info to get info about mp3 files too
20040918 Reports all mp3info fields in OPTS.fmt; handles MPG2LSF sizes + added MONO, DOWNSAMP flags, changed default behavior.
20050411 added r m for report of median VBR in mp3info (thx Merrie Morris)
2006086 Modified by Shlomo Dubnov to allow empty second argument N
$Header: /homes/dpwe/matlab/columbiafns/RCS/mp3read.m,v 1.7 2005/04/11 20:55:16 dpwe Exp $
MP3WRITE
Syntax:
MP3WRITE(Y,FS,NBITS, FILE)
Comments:
mp3write(D,SR,FILE,OPTIONS)
Waveform data from D at sample rate SR is written to a file named FILE as MP3 compressed audio. OPTIONS specifies flags to lame encoder (default 'quiet h');
File History:
20051110 Original version
READMIDI
Syntax:
NMAT = READMIDI(MIDIFILENAME)
Comments:
Conversion of MIDI file to note matrix.
Input argument:
FILENAME = name of the MIDI file (string)
Output:
NMAT = notematrix
WRITEMIDI
Syntax:
WRITEMIDI(NMAT, MIDIFILENAME, TPQ, TEMPO, TSIG1, TSIG2)
Comments:
Creates a MIDI file from a NMAT
Input arguments: NMAT = notematrix
OFNAME = Output filename (*.mid)
TPQ (Optional) = Ticks per quarter note (default 120)
TEMPO (Optional) = bpm, beats per minute (default 100)
TSIG1&2 (Optional) = Timesignature, e.g. 6/8 > TSIG1 = 6, TSIG2 = 8 (default 4)
Output: MIDI file
Remarks: TEXT2MIDI converter needs to be handled differently in PC and Mac.
Example: writemidi(a,'demo.mid'); creates a file name DEMO.MID from notematrix A with default settings.
2. Representation and some applications
Using representations of audio/music with digital formats such as WAV and MIDI enables us to derive many “higher level” representations. This section outlines classical Fourierbased techniques such as the shorttime Fourier transform and newer methods such as the chromagram.
ShortTime Fourier Transform and Spectrum Estimation:
STFT
Syntax:
[B] = STFT(winlen,overlap,nfft)
Comments:
This function calculates the STFT for a samples
a  sample vector
win  window samples or window size (for hanning window)
overlap  overlap in samples
nfft  fft size (could be >= winlen)
(c) Shlomo Dubnov sdubnov@ucsd.edu
ISTFT
Syntax:
xr = ISTFT(X,hopfac,winlen)
Comments:
This function calculates the inverse STFT for a STFT matrix
X  STFT matrix (bins 1:nfft/2+1)
hopfac  hop factor. This is an integer specifying the number of analysis hops occurring within a signal frame that is one window in length. In other words, winlen/hopfac is the hop size in saamples and winlen*(11/hopfac) is the overlap in samples.
winlen  window length in samples
type  'perfect' or 'smooth'
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
[a] = ISTFTW2 (B,hopfactor,window)
Comments:
>>?????????????????????????<<
Syntax:
x = OLA(xmat,hop,win);
Comments:
Overlap add with extra smoothing. This window is used for extra smoothing.
xmat  input matrix containing signal frames
hop  analysis hop size in samples
win  posterior windowing function (window samples). Default hanning.
xi = TIMESTRECH(x,r)
Comments:
Time stretch a sound vector x by factor r.
Syntax:
xr = PITCHSHIFT(x,r)
Comments:
Pitch shift a sound vector x by factor r.
Syntax:
[X2,instfreq] = PVINTERP(X, t, hop)
Comments:
interpolate STFT matrix X according to new analysis frame indices
t starts at 1 and ends at the last column of X
[rec , D] = LSEE(y1,window,hopfactor) (“hear Pollock!”)
Instantaneous frequency:
Syntax:
out = GETINSTF(x,fs,Nfft)
Comments:
Interactive applications that allows to read the instantaneous frequency by clicking a spectrogram.
x  input sound file
fs  sampling frequency
Nfft  fft analysis
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
[F,D] = IFGRAM(X, N, W, H, SR)
Comments:
Instantaneous frequency by phase deriv.
X is a 1D signal. Process with Npoint FFTs applying a Wpoint window, stepping by H points; return (N/2)+1 channels with the instantaneous frequency (as a proportion of the sampling rate) obtained as the timederivative of the phase of the complex spectrum as described by Toshihiro Abe et al in ICASSP'95, Eurospeech'97
Same arguments and some common code as dpwebox/stft.m.
Calculates regular STFT as side effect  returned in D.
after 1998may02 dpwe@icsi.berkeley.edu
20010305 dpwe@ee.columbia.edu revised version
20011213 dpwe@ee.columbia.edu Fixed to work when N != W
Syntax:
[a,f,t,p] = instf(x,fs,Nfft,hopfac)
Comments:
Instantaneous Frequency using hanning window and hop 1 method
x  input sound
fs  sampling frequency
Nfft  fft size, equal to window size (default 512)
hopfac  hop factor (2 means 50% overlap, 4 means 75% overlap, default 4)
a  amplitude
f  instanteneous frequency
t  time
p  phasor (fft with unit magnitude)
(c) Shlomo Dubnov sdubnov@ucsd.edu
CONSTQ
Syntax:
cq= CONSTQ(x, sparKernel)
Comments:
>>?????????????????????????<<
CQGRAM
Syntax:
[CQX, F, T] = CQGRAM(x, winlen, overlap, minFreq, maxFreq, bins, fs, thresh)
Comments:
Makes a running constant Q analysis of a sound
winlen  size of signal frames submited to analysis in samples
overlap  size of overlap between frames in samples
minFreq, maxFreq, bins  constant Q analysis parameters: the range of frequencies
[minFreq maxFreq] is divided into "bins" number of bins per octave
fs  sampling frequency
thresh  threshold for Kernel calculation.
slowQ, constQ, sparseKernel by Benjamin Blankertz (slight modifications by Shlomo Dubnov).
See also Benjamin Blankertz paper.
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
cq = SLOWQ(x, minFreq, maxFreq, bins, fs)
Comments:
x  input sound
minFreq, maxFreq, bins  the range of frequencies [minFreq maxFreq] is
divided into "bins" number of bins per octave
Slow implementation of constant Q transfrom from Benjamin Blankertz
Syntax:
sparKernel= sparseKernel(minFreq, maxFreq, bins, fs, thresh)
Comments:
>>?????????????????????????<<
Sinusoidal Models (YASA):
>>?????????????????????????<<
SourceFilter/AutoRegressive Models:
LPC
Parametric methods
3. Features
The following are common features extracted from music/audio used in machine learning and signal processing applications.
MFCC
Syntax:
[ceps,freqresp,fb,fbrecon,freqrecon] = MFCC(input, samplingRate, [frameRate])
Comments:
Find the cepstral coefficients (ceps) corresponding to the input. Four other quantities are optionally returned that represent: the detailed fft magnitude (freqresp) used in MFCC calculation, the melscale filter bank output (fb), the filter bank output by inverting the cepstrals with a cosine transform (fbrecon), the smooth frequency response by interpolating the fb reconstruction (freqrecon)
 Malcolm Slaney, August 1993
Modified a bit to make testing an algorithm easier... 4/15/94
Fixed Cosine Transform (indices of cos() were swapped)  5/26/95
Added optional frameRate argument  6/8/95
Added proper filterbank reconstruction using inverse DCT  10/27/95
Added filterbank inversion to reconstruct spectrum  11/1/95
(c) 1998 Interval Research Corporation
ERBFILTERBANK
Syntax:
function output = ERBFilterBank(x, fcoefs)
Comments:
Process an input waveform with a gammatone filter bank. This function takes a single sound vector, and returns an array of filter outputs, one channel per row.
The fcoefs parameter, which completely specifies the Gammatone filterbank, should be designed with the MakeERBFilters function. If it is omitted, the filter coefficients are computed for you assuming a 22050Hz sampling rate and 64 filters regularly spaced on an ERB scale from fs/2 down to 100Hz.
Malcolm Slaney @ Interval, June 11, 1998.
(c) 1998 Interval Research Corporation
Thanks to Alain de Cheveigne' for his suggestions and improvements.
MAKEERBFILTERS
Syntax:
[fcoefs]=MAKEERBFILTERS(fs,numChannels,lowFreq)
Comments:
This function computes the filter coefficients for a bank of Gammatone filters. These filters were defined by Patterson and Holdworth for simulating the cochlea.
The result is returned as an array of filter coefficients. Each row of the filter arrays contains the coefficients for four second order filters. The transfer function for these four filters share the same denominator (poles) but have different numerators (zeros). All of these coefficients are assembled into one vector that the ERBFilterBank can take apart to implement the filter.
The filter bank contains "numChannels" channels that extend from half the sampling rate (fs) to "lowFreq". Alternatively, if the numChannels input argument is a vector, then the values of this vector are taken to be the center frequency of each desired filter. (The lowFreq argument is ignored in this case.)
Note this implementation fixes a problem in the original code by computing four separate second order filters. This avoids a big problem with round off errors in cases of very small cfs (100Hz) and large sample rates (44kHz). The problem is caused by roundoff error when a number of poles are combined, all very close to the unit circle. Small errors in the eigth order coefficient, are multiplied when the eigth root is taken to give the pole location. These small errors lead to poles outside the unit circle and instability. Thanks to Julius Smith for leading me to the proper explanation.
Execute the following code to evaluate the frequency response of a 10 channel filterbank.
fcoefs = MakeERBFilters(16000,10,100);
y = ERBFilterBank([1 zeros(1,511)], fcoefs);
resp = 20*log10(abs(fft(y')));
freqScale = (0:511)/512*16000;
semilogx(freqScale(1:255),resp(1:255,:));
axis([100 16000 60 0])
xlabel('Frequency (Hz)'); ylabel('Filter Response (dB)');
Rewritten by Malcolm Slaney@Interval. June 11, 1998.
(c) 1998 Interval Research Corporation
WPHASEDEV
Syntax:
[wpd, nwpd, dif, T, F] = WPHASEDEV(x,fs,Nfft,hopfac)
Comments:
Weighted phase difference of signal x.
Inputs:
x  input sound file
fs  sampling frequency
Nfft  size of fft analysis
hopfac  fft hop factor
Output:
wpd  weighted phase deviation
nwpd  normalized weighted phase deviation
dif  second derivative of the phase (difference in instantaneous frequency)
T,F  time and frequency axes. Good for plotting.
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
sf = SPECTRALFLUX(x,fs,Nfft,overlap)
Comments:
Spectral flux of a sound.
x  input sound
Nfft  size of fft analysis
overlap  overlap of fft frames
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
[pd, dif, T, F] = PHASEDEV(x,fs,Nfft,hopfac)
Comments:
Phase deviation of signal x.
Inputs:
x  input sound file
fs  sampling frequency
Nfft  size of fft analysis
hopfac  fft hop factor
Outputs:
pd  total phase deviation
dif  second derivative of the phase (difference in instantaneous frequency)
T,F  time and frequency axes. Good for plotting.
(c) Shlomo Dubnov sdubnov@ucsd.edu
Singular Value Clustering (SVD)
Homophonic
Polyphonic
4. Sequence Matching
Sequence matching deals with finding relationships and similarity structure between a reference sequence and a query signal.
Syntax:
[p,q,D] = DTW(S,in,del)
Comments:
S = similarity matrix between two seqeunces
ins,del = delition and insertion penalties
similarity is the cost of moving on diagonal (advancing together)
delition and insertions operations are defined for sequence 2, i.e.
delition is the cost of advancing on the columns index (seq. 2 speeding) and insertion is the const of advancing on rows index (seq. 2 waiting)
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
[p,q,D] = ALIGN(S,type,ins,del,thresh)
Comments:
S = similarity matrix between two seqeunces
ins,del = deletion and insertion penalties
thresh = threshold that applies to LCS and TWLCS methods only.
type = alignment method. Supported methods are:
GSA global sequence alignment
LSA local sequence alignment
LCS longest common subsequence
ASM approximate sequence matching
OLM overlap match
DTW dynamic time warping
TWLCS time warped longest common subsequence
similarity is the cost of moving on a diagonal (advancing together)
delition and insertions operations are defined in terms of operations
done on the second (horizontal) sequence, i.e. deletion is the horizontal
cost (accellerando / speed up) and insertion is the vertical cost (ritenuto / wait).
(c) Shlomo Dubnov sdubnov@ucsd.edu
5. Distance metrics
Common distance metrics on data used in algorithms.
Euclidean
KullbackLeibler
6.Selfsimilarity and clustering
A selfsimilar object is exactly or approximately similar to a part of itself, i.e., the whole has the same shape as one or more of the parts.Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait  often proximity according to some defined distance measure. [wikipedia]
Recurrence/similarity matrix
Kmeans
Statistical classification is a statistical procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items.
GMM
HMM
SVM