Computer Audition Toolbox

documentation

            

The CATbox is maintained by Shlomo Dubnov and Mehrdad Yazdani.

 

1. Data i/o

WAVREAD/WRITE

MP3READ/WRITE

MIDIREAD/WRITE

2. Representation and some applications

STFT

Const Q

Inst freq

AR models (LPC)

Sinusoidal models (YASA)

3. Features

MFCC

ERB filters

Onset Detection

Spectral Envelop

 

4. Sequence Matching

Time-warping

Factor Oracle

Suffix Trees

5. Distance Metrics

Euclidean

Kullback-Leibler

6.Self-similarity and clustering

Recurrence/similarity matrix

K-means

7. Classifiers

GMM

HMM

SVM


1. Data i/o

This section deals with reading/writing digital audio in MATLAB. The file types that we consider are WAV, MP3, and MIDI.

 

WAVREAD

Syntax:

y = WAVREAD(filename)
[y,Fs,bits] = WAVREAD (filename)
[...] = WAVREAD (filename,N)
[...] = WAVREAD (filename,[N1 N2])
siz = WAVREAD (filename,'size') 

Comments:  

wavread supports multichannel data, with up to 32 bits per sample, and supports reading 24- and 32-bit .wav files.

y = wavread(filename) loads a WAVE file specified by filename, returning the sampled data in y. The filename input is a string enclosed in single quotes. The .wav extension is appended if no extension is given. Amplitude values are in the range [-1,+1].

[y,Fs,bits] = wavread(filename) returns the sample rate (Fs) in Hertz and the number of bits per sample (bits) used to encode the data in the file.

[...] = wavread(filename,N) returns only the first N samples from each channel in the file.

[...] = wavread(filename,[N1 N2]) returns only samples N1 through N2 from each channel in the file.

siz = wavread(filename,'size') returns the size of the audio data contained in the file in place of the actual audio data, returning the vector siz = [samples channels].

 

WAVWRITE

Syntax:

WAVWRITE(y,filename)
WAVWRITE(y,Fs,filename)
WAVWRITE (y,Fs,N,filename)

Comments:

wavwrite writes data to 8-, 16-, 24-, and 32-bit .wav files.

wavwrite(y,filename) writes the data stored in the variable y to a WAVE file called filename. The filename input is a string enclosed in single quotes. The data has a sample rate of 8000 Hz and is assumed to be 16-bit. Each column of the data represents a separate channel. Therefore, stereo data should be specified as a matrix with two columns. Amplitude values outside the range [-1,+1] are clipped prior to writing.

wavwrite(y,Fs,filename) writes the data stored in the variable y to a WAVE file called filename. The data has a sample rate of Fs Hz and is assumed to be 16-bit. Amplitude values outside the range [-1,+1] are clipped prior to writing.

wavwrite(y,Fs,N,filename) writes the data stored in the variable y to a WAVE file called filename. The data has a sample rate of Fs Hz and is N-bit, where N is 8, 16, 24, or 32. For N < 32, amplitude values outside the range [-1,+1] are clipped.

Note   8-, 16-, and 24-bit files are type 1 integer pulse code modulation (PCM). 32-bit files are written as type 3 normalized floating point.

 

MP3READ

Syntax:

[Y,FS,NBITS,OPTS] = MP3READ(FILE,N,MONO,DOWNSAMP)

 Comments:

 Reads MP3 audio file.

 FILE is an MP3 audio file name. 

 Optional scalar N limits the number of frames to read, or, if a two-element vector, specifies the first and last frames to read.

 Optional flag MONO forces result to be mono if nonzero.      

 Optional factor DOWNSAMP performs downsampling by 2 or 4.

 Y returns the audio data, one channel per column; FS is the sampling rate.  NBITS is the bit depth (always 16). OPTS.fmt is a format info string.

 File history:

 2003-07-20 dpwe@ee.columbia.edu - This version calls mpg123.

 2004-08-31 Fixed to read whole files correctly

 2004-09-08 Uses mp3info to get info about mp3 files too

 2004-09-18 Reports all mp3info fields in OPTS.fmt; handles MPG2LSF sizes + added  MONO, DOWNSAMP flags, changed default behavior.

 2005-04-11 added -r m for report of median VBR in mp3info (thx Merrie Morris)

 2006-08-6 Modified by Shlomo Dubnov to allow empty second argument N

 $Header: /homes/dpwe/matlab/columbiafns/RCS/mp3read.m,v 1.7 2005/04/11 20:55:16 dpwe Exp $

  

MP3WRITE

Syntax:

MP3WRITE(Y,FS,NBITS, FILE)

Comments:

mp3write(D,SR,FILE,OPTIONS)

Waveform data from D at sample rate SR is written to a file named FILE as MP3 compressed audio.  OPTIONS  specifies flags to lame encoder (default '--quiet -h');

File History:

2005-11-10 Original version

 

READMIDI

Syntax:

NMAT = READMIDI(MIDIFILENAME)

Comments:

Conversion of MIDI file to note matrix.

Input argument:

            FILENAME = name of the MIDI file (string)

Output:

            NMAT = notematrix

 

WRITEMIDI

Syntax:

WRITEMIDI(NMAT, MIDIFILENAME, TPQ, TEMPO, TSIG1, TSIG2)

Comments:

Creates a MIDI file from a NMAT

Input arguments: NMAT = notematrix

OFNAME = Output filename (*.mid)

TPQ (Optional) = Ticks per quarter note (default 120)

TEMPO (Optional) = bpm, beats per minute (default 100)

TSIG1&2 (Optional) = Time-signature, e.g. 6/8 -> TSIG1 = 6, TSIG2 = 8 (default 4)

Output: MIDI file

Remarks: TEXT2MIDI converter needs to be handled differently in PC and Mac.

Example: writemidi(a,'demo.mid'); creates a file name DEMO.MID from notematrix A with default settings.

 

2. Representation and some applications

Using representations of audio/music with digital formats such as WAV and MIDI enables us to derive many “higher level” representations. This section outlines classical Fourier-based techniques such as the short-time Fourier transform and newer methods such as the chromagram.

 

Short-Time Fourier Transform and Spectrum Estimation:

STFT

Syntax:

[B] =  STFT(winlen,overlap,nfft)

Comments:

This function calculates the STFT for a samples

a - sample vector

win - window samples or window size (for hanning window)

overlap - overlap in samples

nfft - fft size (could be >= winlen)

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

ISTFT

Syntax:

xr =  ISTFT(X,hopfac,winlen)

Comments:

This function calculates the inverse STFT for a STFT matrix

X - STFT matrix (bins 1:nfft/2+1)

hopfac - hop factor. This is an integer specifying the number of analysis hops occurring within a signal frame that is one window in length. In other words, winlen/hopfac is the hop size in saamples and winlen*(1-1/hopfac) is the overlap in samples.

winlen - window length in samples

type - 'perfect' or 'smooth'

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

ISTFTW2

Syntax:

            [a] =  ISTFTW2 (B,hopfactor,window)

Comments:

>>?????????????????????????<<

 

OLA

Syntax:

x = OLA(xmat,hop,win);

Comments:

Overlap add with extra smoothing. This window is used for extra smoothing.

xmat - input matrix containing signal frames

hop - analysis hop size in samples

win - posterior windowing function (window samples). Default hanning.

 

TIMESTRECH

 

Syntax:

xi = TIMESTRECH(x,r)

Comments:

Time stretch a sound vector x by factor r.

 

PITCHSHIFT

Syntax:

xr = PITCHSHIFT(x,r)

Comments:

Pitch shift a sound vector x by factor r.

 

PVINTERP

Syntax:

[X2,instfreq] = PVINTERP(X, t, hop)

Comments:

interpolate STFT matrix X according to new analysis frame indices

t starts at 1 and ends at the last column of X

[rec , D] = LSEE(y1,window,hopfactor) (“hear Pollock!”)

 

Instantaneous frequency:

GETINSTF

Syntax:

out = GETINSTF(x,fs,Nfft)

Comments:

Interactive applications that allows to read the instantaneous frequency by clicking a spectrogram.

x - input sound file

fs - sampling frequency

Nfft - fft analysis

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

IFGRAM

Syntax:

[F,D] = IFGRAM(X, N, W, H, SR)

Comments:

Instantaneous frequency by phase deriv.

X is a 1-D signal.  Process with N-point FFTs applying a W-point window, stepping by H points; return (N/2)+1 channels with the instantaneous frequency (as a proportion of the sampling rate) obtained as the time-derivative of the phase of the complex spectrum as described by Toshihiro Abe et al in ICASSP'95, Eurospeech'97

Same arguments and some common code as dpwebox/stft.m.

Calculates regular STFT as side effect - returned in D.

after 1998may02 dpwe@icsi.berkeley.edu

2001-03-05 dpwe@ee.columbia.edu  revised version

2001-12-13 dpwe@ee.columbia.edu  Fixed to work when N != W

 

INSTF

Syntax:

[a,f,t,p] = instf(x,fs,Nfft,hopfac)

Comments:

Instantaneous Frequency using hanning window and hop 1 method

x - input sound

fs - sampling frequency

Nfft - fft size, equal to window size (default 512)

hopfac - hop factor (2 means 50% overlap, 4 means 75% overlap, default 4)

a - amplitude

f - instanteneous frequency

t - time

p - phasor (fft with unit magnitude)

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

Constant Q Transform:

 

CONSTQ

Syntax:

           cq= CONSTQ(x, sparKernel)

Comments:

>>?????????????????????????<<

 

CQGRAM

Syntax:

[CQX, F, T] = CQGRAM(x, winlen, overlap, minFreq, maxFreq, bins, fs, thresh)

 Comments:

Makes a running constant Q analysis of a sound

winlen - size of signal frames submited to analysis in samples

overlap - size of overlap between frames in samples

minFreq, maxFreq, bins - constant Q analysis parameters: the range of frequencies

[minFreq maxFreq] is divided into "bins" number of bins per octave

fs - sampling frequency

thresh - threshold for Kernel calculation.

slowQ, constQ, sparseKernel by Benjamin Blankertz (slight modifications by Shlomo Dubnov).

See also Benjamin Blankertz paper.

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

SLOWQ

Syntax:

cq = SLOWQ(x, minFreq, maxFreq, bins, fs)

 Comments:

 x - input sound

minFreq, maxFreq, bins - the range of frequencies [minFreq maxFreq] is

divided into "bins" number of bins per octave

Slow implementation of constant Q transfrom from Benjamin Blankertz

 

SPARSEKERNEL

Syntax:

            sparKernel= sparseKernel(minFreq, maxFreq, bins, fs, thresh)

Comments:

>>?????????????????????????<<

 

Sinusoidal Models (YASA):

 

>>?????????????????????????<<

 

Source-Filter/Auto-Regressive Models:

 

LPC

Parametric methods

 

 

3. Features

 

The following are common features extracted from music/audio used in machine learning and signal processing applications.

 

MFCC

Syntax:

[ceps,freqresp,fb,fbrecon,freqrecon] = MFCC(input, samplingRate, [frameRate])

Comments:

Find the cepstral coefficients (ceps) corresponding to the input.  Four other quantities are optionally returned that represent: the detailed fft magnitude (freqresp) used in MFCC calculation, the mel-scale filter bank output (fb), the filter bank output by inverting the cepstrals with a cosine transform (fbrecon), the smooth frequency response by interpolating the fb reconstruction (freqrecon)

 

-- Malcolm Slaney, August 1993

Modified a bit to make testing an algorithm easier... 4/15/94

Fixed Cosine Transform (indices of cos() were swapped) - 5/26/95

Added optional frameRate argument - 6/8/95

Added proper filterbank reconstruction using inverse DCT - 10/27/95

Added filterbank inversion to reconstruct spectrum - 11/1/95

(c) 1998 Interval Research Corporation 

 

ERBFILTERBANK

Syntax:

function output = ERBFilterBank(x, fcoefs)

Comments:

Process an input waveform with a gammatone filter bank. This function takes a single sound vector, and returns an array of filter outputs, one channel per row.

The fcoefs parameter, which completely specifies the Gammatone filterbank, should be designed with the MakeERBFilters function.  If it is omitted, the filter coefficients are computed for you assuming a 22050Hz sampling rate and 64 filters regularly spaced on an ERB scale from fs/2 down to 100Hz.

Malcolm Slaney @ Interval, June 11, 1998.

(c) 1998 Interval Research Corporation 

Thanks to Alain de Cheveigne' for his suggestions and improvements.

 

MAKEERBFILTERS

Syntax:

[fcoefs]=MAKEERBFILTERS(fs,numChannels,lowFreq)

Comments:

This function computes the filter coefficients for a bank of Gammatone filters. These filters were defined by Patterson and Holdworth for simulating the cochlea. 

The result is returned as an array of filter coefficients.  Each row of the filter arrays contains the coefficients for four second order filters.  The transfer function for these four filters share the same denominator (poles) but have different numerators (zeros).  All of these coefficients are assembled into one vector that the ERBFilterBank can take apart to implement the filter.

The filter bank contains "numChannels" channels that extend from half the sampling rate (fs) to "lowFreq".  Alternatively, if the numChannels input argument is a vector, then the values of this vector are taken to be the center frequency of each desired filter.  (The lowFreq argument is ignored in this case.)

Note this implementation fixes a problem in the original code by computing four separate second order filters.  This avoids a big problem with round off errors in cases of very small cfs (100Hz) and large sample rates (44kHz).  The problem is caused by roundoff error when a number of poles are combined, all very close to the unit circle.  Small errors in the eigth order coefficient, are multiplied when the eigth root is taken to give the pole location.  These small errors lead to poles outside the unit circle and instability.  Thanks to Julius Smith for leading me to the proper explanation.

 Execute the following code to evaluate the frequency response of a 10 channel filterbank.

            fcoefs = MakeERBFilters(16000,10,100);

            y = ERBFilterBank([1 zeros(1,511)], fcoefs);

            resp = 20*log10(abs(fft(y')));

            freqScale = (0:511)/512*16000;

            semilogx(freqScale(1:255),resp(1:255,:));

            axis([100 16000 -60 0])

            xlabel('Frequency (Hz)'); ylabel('Filter Response (dB)');

 

Rewritten by Malcolm Slaney@Interval.  June 11, 1998.

(c) 1998 Interval Research Corporation 

  

Temporal Extraction; Onset Detection:

 

WPHASEDEV

Syntax:

[wpd, nwpd, dif, T, F] = WPHASEDEV(x,fs,Nfft,hopfac)

Comments:

Weighted phase difference of signal x.

Inputs:

x - input sound file

fs - sampling frequency

Nfft - size of fft analysis

hopfac - fft hop factor

Output:

wpd - weighted phase deviation

nwpd - normalized weighted phase deviation

dif - second derivative of the phase (difference in instantaneous frequency)

T,F - time and frequency axes. Good for plotting.

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

SPECTRALFLUX

Syntax:

sf = SPECTRALFLUX(x,fs,Nfft,overlap)

 Comments:

 Spectral flux of a sound.

 x - input sound

Nfft - size of fft analysis

overlap - overlap of fft frames

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

PHASEDEV

Syntax:

[pd, dif, T, F] = PHASEDEV(x,fs,Nfft,hopfac)

Comments:

Phase deviation of signal x.

Inputs:

x - input sound file

fs - sampling frequency

Nfft - size of fft analysis

hopfac - fft hop factor

Outputs:

pd - total phase deviation

dif - second derivative of the phase (difference in instantaneous frequency)

T,F - time and frequency axes. Good for plotting.

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

Segmentation

 

Singular Value Clustering (SVD)

 

Pitch Extraction

 

Homophonic

Polyphonic

 

4. Sequence Matching

Sequence matching deals with finding relationships and similarity structure between a reference sequence and a query signal.

 

DTW

Syntax:

            [p,q,D] = DTW(S,in,del)

Comments:

S = similarity matrix between two seqeunces

ins,del = delition and insertion penalties

similarity is the cost of moving on diagonal (advancing together)

delition and insertions operations are defined for sequence 2, i.e.

delition is the cost of advancing on the columns index (seq. 2 speeding) and insertion is the const of advancing on rows index (seq. 2 waiting)

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

ALIGN

Syntax:

[p,q,D] = ALIGN(S,type,ins,del,thresh)

Comments:

S = similarity matrix between two seqeunces

ins,del = deletion and insertion penalties

thresh = threshold that applies to LCS and TWLCS methods only.

type = alignment method. Supported methods are:

GSA global sequence alignment

LSA local sequence alignment

LCS longest common subsequence

ASM approximate sequence matching

OLM overlap match

DTW dynamic time warping

TWLCS time warped longest common subsequence

similarity is the cost of moving on a diagonal (advancing together)

delition and insertions operations are defined in terms of operations

done on the second (horizontal) sequence, i.e. deletion is the horizontal

cost (accellerando / speed up) and insertion is the vertical cost (ritenuto / wait).

(c) Shlomo Dubnov sdubnov@ucsd.edu

 

 

5. Distance metrics

Common distance metrics on data used in algorithms.

 

Euclidean

Kullback-Leibler

 

6.Self-similarity and clustering

 

A self-similar object is exactly or approximately similar to a part of itself, i.e., the whole has the same shape as one or more of the parts.Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. [wikipedia]

 

Recurrence/similarity matrix

K-means

 

7.Classifiers

 

Statistical classification is a statistical procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items.

 

GMM

HMM

SVM