Computer Audition Toolbox documentation |
|
The CATbox is maintained by Shlomo Dubnov and Mehrdad Yazdani.
WAVREAD/WRITE MP3READ/WRITE MIDIREAD/WRITE 2. Representation and some applications STFT Const Q Inst freq AR models (LPC) Sinusoidal models (YASA) MFCC ERB filters Onset Detection Spectral Envelop
|
Time-warping Factor Oracle Suffix Trees Euclidean Kullback-Leibler 6.Self-similarity and clustering Recurrence/similarity matrix K-means GMM HMM SVM |
1. Data i/o
This section deals with reading/writing digital audio in MATLAB. The file types that we consider are WAV, MP3, and MIDI.
WAVREAD
Syntax:
y =
WAVREAD(filename)
[y,Fs,bits] = WAVREAD (filename)
[...] = WAVREAD
(filename,N)
[...] = WAVREAD (filename,[N1 N2])
siz = WAVREAD
(filename,'size')
Comments:
wavread supports multichannel data, with up to 32 bits per sample, and supports reading 24- and 32-bit .wav files.
y = wavread(filename) loads a WAVE file specified by filename, returning the sampled data in y. The filename input is a string enclosed in single quotes. The .wav extension is appended if no extension is given. Amplitude values are in the range [-1,+1].
[y,Fs,bits] = wavread(filename) returns the sample rate (Fs) in Hertz and the number of bits per sample (bits) used to encode the data in the file.
[...] = wavread(filename,N) returns only the first N samples from each channel in the file.
[...] = wavread(filename,[N1 N2]) returns only samples N1 through N2 from each channel in the file.
siz = wavread(filename,'size') returns the size of the audio data contained in the file in place of the actual audio data, returning the vector siz = [samples channels].
WAVWRITE
Syntax:
WAVWRITE(y,filename)
WAVWRITE(y,Fs,filename)
WAVWRITE
(y,Fs,N,filename)
Comments:
wavwrite writes data to 8-, 16-, 24-, and 32-bit .wav files.
wavwrite(y,filename) writes the data stored in the variable y to a WAVE file called filename. The filename input is a string enclosed in single quotes. The data has a sample rate of 8000 Hz and is assumed to be 16-bit. Each column of the data represents a separate channel. Therefore, stereo data should be specified as a matrix with two columns. Amplitude values outside the range [-1,+1] are clipped prior to writing.
wavwrite(y,Fs,filename) writes the data stored in the variable y to a WAVE file called filename. The data has a sample rate of Fs Hz and is assumed to be 16-bit. Amplitude values outside the range [-1,+1] are clipped prior to writing.
wavwrite(y,Fs,N,filename) writes the data stored in the variable y to a WAVE file called filename. The data has a sample rate of Fs Hz and is N-bit, where N is 8, 16, 24, or 32. For N < 32, amplitude values outside the range [-1,+1] are clipped.
Note 8-, 16-, and 24-bit files are type 1 integer pulse code modulation (PCM). 32-bit files are written as type 3 normalized floating point.
MP3READ
Syntax:
[Y,FS,NBITS,OPTS] = MP3READ(FILE,N,MONO,DOWNSAMP)
Comments:
Reads MP3 audio file.
FILE is an MP3 audio file name.
Optional scalar N limits the number of frames to read, or, if a two-element vector, specifies the first and last frames to read.
Optional flag MONO forces result to be mono if nonzero.
Optional factor DOWNSAMP performs downsampling by 2 or 4.
Y returns the audio data, one channel per column; FS is the sampling rate. NBITS is the bit depth (always 16). OPTS.fmt is a format info string.
File history:
2003-07-20 dpwe@ee.columbia.edu - This version calls mpg123.
2004-08-31 Fixed to read whole files correctly
2004-09-08 Uses mp3info to get info about mp3 files too
2004-09-18 Reports all mp3info fields in OPTS.fmt; handles MPG2LSF sizes + added MONO, DOWNSAMP flags, changed default behavior.
2005-04-11 added -r m for report of median VBR in mp3info (thx Merrie Morris)
2006-08-6 Modified by Shlomo Dubnov to allow empty second argument N
$Header: /homes/dpwe/matlab/columbiafns/RCS/mp3read.m,v 1.7 2005/04/11 20:55:16 dpwe Exp $
MP3WRITE
Syntax:
MP3WRITE(Y,FS,NBITS, FILE)
Comments:
mp3write(D,SR,FILE,OPTIONS)
Waveform data from D at sample rate SR is written to a file named FILE as MP3 compressed audio. OPTIONS specifies flags to lame encoder (default '--quiet -h');
File History:
2005-11-10 Original version
READMIDI
Syntax:
NMAT = READMIDI(MIDIFILENAME)
Comments:
Conversion of MIDI file to note matrix.
Input argument:
FILENAME = name of the MIDI file (string)
Output:
NMAT = notematrix
WRITEMIDI
Syntax:
WRITEMIDI(NMAT, MIDIFILENAME, TPQ, TEMPO, TSIG1, TSIG2)
Comments:
Creates a MIDI file from a NMAT
Input arguments: NMAT = notematrix
OFNAME = Output filename (*.mid)
TPQ (Optional) = Ticks per quarter note (default 120)
TEMPO (Optional) = bpm, beats per minute (default 100)
TSIG1&2 (Optional) = Time-signature, e.g. 6/8 -> TSIG1 = 6, TSIG2 = 8 (default 4)
Output: MIDI file
Remarks: TEXT2MIDI converter needs to be handled differently in PC and Mac.
Example: writemidi(a,'demo.mid'); creates a file name DEMO.MID from notematrix A with default settings.
2. Representation and some applications
Using representations of audio/music with digital formats such as WAV and MIDI enables us to derive many “higher level” representations. This section outlines classical Fourier-based techniques such as the short-time Fourier transform and newer methods such as the chromagram.
Short-Time Fourier Transform and Spectrum Estimation:
STFT
Syntax:
[B] = STFT(winlen,overlap,nfft)
Comments:
This function calculates the STFT for a samples
a - sample vector
win - window samples or window size (for hanning window)
overlap - overlap in samples
nfft - fft size (could be >= winlen)
(c) Shlomo Dubnov sdubnov@ucsd.edu
ISTFT
Syntax:
xr = ISTFT(X,hopfac,winlen)
Comments:
This function calculates the inverse STFT for a STFT matrix
X - STFT matrix (bins 1:nfft/2+1)
hopfac - hop factor. This is an integer specifying the number of analysis hops occurring within a signal frame that is one window in length. In other words, winlen/hopfac is the hop size in saamples and winlen*(1-1/hopfac) is the overlap in samples.
winlen - window length in samples
type - 'perfect' or 'smooth'
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
[a] = ISTFTW2 (B,hopfactor,window)
Comments:
>>?????????????????????????<<
Syntax:
x = OLA(xmat,hop,win);
Comments:
Overlap add with extra smoothing. This window is used for extra smoothing.
xmat - input matrix containing signal frames
hop - analysis hop size in samples
win - posterior windowing function (window samples). Default hanning.
xi = TIMESTRECH(x,r)
Comments:
Time stretch a sound vector x by factor r.
Syntax:
xr = PITCHSHIFT(x,r)
Comments:
Pitch shift a sound vector x by factor r.
Syntax:
[X2,instfreq] = PVINTERP(X, t, hop)
Comments:
interpolate STFT matrix X according to new analysis frame indices
t starts at 1 and ends at the last column of X
[rec , D] = LSEE(y1,window,hopfactor) (“hear Pollock!”)
Instantaneous frequency:
Syntax:
out = GETINSTF(x,fs,Nfft)
Comments:
Interactive applications that allows to read the instantaneous frequency by clicking a spectrogram.
x - input sound file
fs - sampling frequency
Nfft - fft analysis
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
[F,D] = IFGRAM(X, N, W, H, SR)
Comments:
Instantaneous frequency by phase deriv.
X is a 1-D signal. Process with N-point FFTs applying a W-point window, stepping by H points; return (N/2)+1 channels with the instantaneous frequency (as a proportion of the sampling rate) obtained as the time-derivative of the phase of the complex spectrum as described by Toshihiro Abe et al in ICASSP'95, Eurospeech'97
Same arguments and some common code as dpwebox/stft.m.
Calculates regular STFT as side effect - returned in D.
after 1998may02 dpwe@icsi.berkeley.edu
2001-03-05 dpwe@ee.columbia.edu revised version
2001-12-13 dpwe@ee.columbia.edu Fixed to work when N != W
Syntax:
[a,f,t,p] = instf(x,fs,Nfft,hopfac)
Comments:
Instantaneous Frequency using hanning window and hop 1 method
x - input sound
fs - sampling frequency
Nfft - fft size, equal to window size (default 512)
hopfac - hop factor (2 means 50% overlap, 4 means 75% overlap, default 4)
a - amplitude
f - instanteneous frequency
t - time
p - phasor (fft with unit magnitude)
(c) Shlomo Dubnov sdubnov@ucsd.edu
CONSTQ
Syntax:
cq= CONSTQ(x, sparKernel)
Comments:
>>?????????????????????????<<
CQGRAM
Syntax:
[CQX, F, T] = CQGRAM(x, winlen, overlap, minFreq, maxFreq, bins, fs, thresh)
Comments:
Makes a running constant Q analysis of a sound
winlen - size of signal frames submited to analysis in samples
overlap - size of overlap between frames in samples
minFreq, maxFreq, bins - constant Q analysis parameters: the range of frequencies
[minFreq maxFreq] is divided into "bins" number of bins per octave
fs - sampling frequency
thresh - threshold for Kernel calculation.
slowQ, constQ, sparseKernel by Benjamin Blankertz (slight modifications by Shlomo Dubnov).
See also Benjamin Blankertz paper.
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
cq = SLOWQ(x, minFreq, maxFreq, bins, fs)
Comments:
x - input sound
minFreq, maxFreq, bins - the range of frequencies [minFreq maxFreq] is
divided into "bins" number of bins per octave
Slow implementation of constant Q transfrom from Benjamin Blankertz
Syntax:
sparKernel= sparseKernel(minFreq, maxFreq, bins, fs, thresh)
Comments:
>>?????????????????????????<<
Sinusoidal Models (YASA):
>>?????????????????????????<<
Source-Filter/Auto-Regressive Models:
LPC
Parametric methods
3. Features
The following are common features extracted from music/audio used in machine learning and signal processing applications.
MFCC
Syntax:
[ceps,freqresp,fb,fbrecon,freqrecon] = MFCC(input, samplingRate, [frameRate])
Comments:
Find the cepstral coefficients (ceps) corresponding to the input. Four other quantities are optionally returned that represent: the detailed fft magnitude (freqresp) used in MFCC calculation, the mel-scale filter bank output (fb), the filter bank output by inverting the cepstrals with a cosine transform (fbrecon), the smooth frequency response by interpolating the fb reconstruction (freqrecon)
-- Malcolm Slaney, August 1993
Modified a bit to make testing an algorithm easier... 4/15/94
Fixed Cosine Transform (indices of cos() were swapped) - 5/26/95
Added optional frameRate argument - 6/8/95
Added proper filterbank reconstruction using inverse DCT - 10/27/95
Added filterbank inversion to reconstruct spectrum - 11/1/95
(c) 1998 Interval Research Corporation
ERBFILTERBANK
Syntax:
function output = ERBFilterBank(x, fcoefs)
Comments:
Process an input waveform with a gammatone filter bank. This function takes a single sound vector, and returns an array of filter outputs, one channel per row.
The fcoefs parameter, which completely specifies the Gammatone filterbank, should be designed with the MakeERBFilters function. If it is omitted, the filter coefficients are computed for you assuming a 22050Hz sampling rate and 64 filters regularly spaced on an ERB scale from fs/2 down to 100Hz.
Malcolm Slaney @ Interval, June 11, 1998.
(c) 1998 Interval Research Corporation
Thanks to Alain de Cheveigne' for his suggestions and improvements.
MAKEERBFILTERS
Syntax:
[fcoefs]=MAKEERBFILTERS(fs,numChannels,lowFreq)
Comments:
This function computes the filter coefficients for a bank of Gammatone filters. These filters were defined by Patterson and Holdworth for simulating the cochlea.
The result is returned as an array of filter coefficients. Each row of the filter arrays contains the coefficients for four second order filters. The transfer function for these four filters share the same denominator (poles) but have different numerators (zeros). All of these coefficients are assembled into one vector that the ERBFilterBank can take apart to implement the filter.
The filter bank contains "numChannels" channels that extend from half the sampling rate (fs) to "lowFreq". Alternatively, if the numChannels input argument is a vector, then the values of this vector are taken to be the center frequency of each desired filter. (The lowFreq argument is ignored in this case.)
Note this implementation fixes a problem in the original code by computing four separate second order filters. This avoids a big problem with round off errors in cases of very small cfs (100Hz) and large sample rates (44kHz). The problem is caused by roundoff error when a number of poles are combined, all very close to the unit circle. Small errors in the eigth order coefficient, are multiplied when the eigth root is taken to give the pole location. These small errors lead to poles outside the unit circle and instability. Thanks to Julius Smith for leading me to the proper explanation.
Execute the following code to evaluate the frequency response of a 10 channel filterbank.
fcoefs = MakeERBFilters(16000,10,100);
y = ERBFilterBank([1 zeros(1,511)], fcoefs);
resp = 20*log10(abs(fft(y')));
freqScale = (0:511)/512*16000;
semilogx(freqScale(1:255),resp(1:255,:));
axis([100 16000 -60 0])
xlabel('Frequency (Hz)'); ylabel('Filter Response (dB)');
Rewritten by Malcolm Slaney@Interval. June 11, 1998.
(c) 1998 Interval Research Corporation
WPHASEDEV
Syntax:
[wpd, nwpd, dif, T, F] = WPHASEDEV(x,fs,Nfft,hopfac)
Comments:
Weighted phase difference of signal x.
Inputs:
x - input sound file
fs - sampling frequency
Nfft - size of fft analysis
hopfac - fft hop factor
Output:
wpd - weighted phase deviation
nwpd - normalized weighted phase deviation
dif - second derivative of the phase (difference in instantaneous frequency)
T,F - time and frequency axes. Good for plotting.
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
sf = SPECTRALFLUX(x,fs,Nfft,overlap)
Comments:
Spectral flux of a sound.
x - input sound
Nfft - size of fft analysis
overlap - overlap of fft frames
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
[pd, dif, T, F] = PHASEDEV(x,fs,Nfft,hopfac)
Comments:
Phase deviation of signal x.
Inputs:
x - input sound file
fs - sampling frequency
Nfft - size of fft analysis
hopfac - fft hop factor
Outputs:
pd - total phase deviation
dif - second derivative of the phase (difference in instantaneous frequency)
T,F - time and frequency axes. Good for plotting.
(c) Shlomo Dubnov sdubnov@ucsd.edu
Singular Value Clustering (SVD)
Homophonic
Polyphonic
4. Sequence Matching
Sequence matching deals with finding relationships and similarity structure between a reference sequence and a query signal.
Syntax:
[p,q,D] = DTW(S,in,del)
Comments:
S = similarity matrix between two seqeunces
ins,del = delition and insertion penalties
similarity is the cost of moving on diagonal (advancing together)
delition and insertions operations are defined for sequence 2, i.e.
delition is the cost of advancing on the columns index (seq. 2 speeding) and insertion is the const of advancing on rows index (seq. 2 waiting)
(c) Shlomo Dubnov sdubnov@ucsd.edu
Syntax:
[p,q,D] = ALIGN(S,type,ins,del,thresh)
Comments:
S = similarity matrix between two seqeunces
ins,del = deletion and insertion penalties
thresh = threshold that applies to LCS and TWLCS methods only.
type = alignment method. Supported methods are:
GSA global sequence alignment
LSA local sequence alignment
LCS longest common subsequence
ASM approximate sequence matching
OLM overlap match
DTW dynamic time warping
TWLCS time warped longest common subsequence
similarity is the cost of moving on a diagonal (advancing together)
delition and insertions operations are defined in terms of operations
done on the second (horizontal) sequence, i.e. deletion is the horizontal
cost (accellerando / speed up) and insertion is the vertical cost (ritenuto / wait).
(c) Shlomo Dubnov sdubnov@ucsd.edu
5. Distance metrics
Common distance metrics on data used in algorithms.
Euclidean
Kullback-Leibler
6.Self-similarity and clustering
A self-similar object is exactly or approximately similar to a part of itself, i.e., the whole has the same shape as one or more of the parts.Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. [wikipedia]
Recurrence/similarity matrix
K-means
Statistical classification is a statistical procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items.
GMM
HMM
SVM