Computer Audition Lab 500 (CAL500) data set - 500 songs performed by 500 unique artists - each song has been annotated by at least 3 people using a standard survey Douglas Turnbull, Luke Barrington, David Torres, and Gert Lanckriet Computer Audition Lab UC San Diego May 2007 You can find the original text of the relevant paper at http://eceweb.ucsd.edu/~gert/papers/TASLP-08.pdf. Here are some more details about the data set. SampleForm.htm - example of web-based musical survey songNames.txt - list of 502 songs vocab.txt - list 174 words (unigrams and bi-grams) hardAnnotations.txt - 502 x 174 binary matrix (as a comma-seperated file) - 1 if 80% of the test subject label the song with the word AND a minimum of 2 test subjects label the song with the word - 0, otherwise softAnnotations.txt - 502 x 174 real-valued matrix - proportion of users who considered a song indicative of a word annotations/ - directory of 502 raw annotation files deltas/ - directory of 502 feature files. - Each song is represented as a bag-of-feature-vectors - each feature-vector Music Corpus: The 500 songs were picked from the authors' personal collection of western popular music recorded within the last 50 years. We picked one song at random to represent each of a diverse set of musicians. A subset set of the songs taken from the Magnatunes data set which was used in the 2005 MIREX data mining contest. To our knowledge, these songs are copyright-cleared for the purpose of music information retrieval research and may be downloaded free of charge from the 2005 MIREX wiki. http://www.music-ir.org/mirex2005/ Audio Representation: We represent the audio with a time series of delta cepstrum feature vectors. We extract a time series of the first 13 Mel-frequency cepstral coefficients by sliding a 12 msec half- overlapping short-time window over the waveform data file for each song. We compute a delta cepstrum vector by appending the instantaneous first and second derivatives of each MFCC to the vector of MFCCs. The result is approximately 10,000 39-dimensional feature vectors per minute of audio content. We randomly sub-sample the set of delta cepstrum feature vectors to represent each song with exactly 10,000 feature vectors. It is not possible to reconstruct the original audio waveform from this audio feature representation. MATLAB-readable ASCII files containing the delta cepstrum features for each song are in the "/delta/" directory and are named: artist_name-song_name.delta Semantic Representation: We consider 135 musically-relevant concepts spanning six semantic categories: 29 instruments were annotated as present in the song or not; 22 vocal characteristics were annotated as relevant to the singer or not; 36 genres, a subset of the Codaich genre list, were annotated as relevant to the song or not; 18 emotions, found by Skowronek et al. (2006) to be both important and easy to identify, were rated on a scale from one to three (e.g., ``not happy", ``neutral", ``happy"); 15 song concepts describing the acoustic qualities of the song, artist and recording (e.g., tempo, energy, sound quality); 15 usage terms (e.g., "I would listen to this song while driving, sleeping, etc."). An example of the HTML form presented to annotators is in the file: sample_form.html A list of the semantic concepts used to annotate each song is in the file: vocab.txt We paid 66 undergraduate students to annotate the CAL500 corpus with semantic concepts from our vocabulary. Participants were rewarded $10 for a one hour annotation block spent listening to MP3-encoded music through headphones in a university computer laboratory. The annotation interface was an HTML form loaded in a web browser requiring participants to simply click on check boxes and radio buttons. The form was not presented during the first 30 seconds of playback to encourage undistracted listening. Listeners could advance and rewind the music and the song would repeat until all semantic categories were annotated. Each annotation took about 5 minutes and most participants reported that the listening and annotation experience was enjoyable. We collected at least 3 semantic annotations for each of the 500 songs in our music corpus and a total of 1708 annotations. Text files with the raw annotations are in the "/annotations/" directory and are named: artist_name-song_name-annotation_number.txt We expand the set of 135 survey concepts to a set of 237 `words' by mapping all bipolar concepts to two individual words. For example, the five degrees of the concept `Energy Level' were mapped to `Low Energy' and `High Energy'. The resulting collection of human annotations uses a vector of numbers to express the response of a human listener to a semantic keyword. For each word, the annotation vector takes the value +1 or -1 if the human annotator considers the song is or is not indicative of the word, or 0 if unsure. We take all the human annotations for each song and combine them to a single annotation vector for that song by observing the level of agreement over all annotators. The final semantic weights for a song/word pair are: weight(song, word) = max ( 0, #positive votes - #negative votes / #annotations) For example, for a given song and word, if four listeners labeled the song with +1, +1, 0, -1, then the weight is 1/4. This data is stored as a comma-separated, MATLAB readable (function 'dlmread') ASCII file 'softAnnotations.txt' For evaluation purposes, we also create `ground truth' binary annotation vectors. We generate binary vectors by labeling a song with a word if a minimum of two people express an opinion and there is at least 80% agreement between all listeners. We prune all concepts that are represented by fewer than five songs. This reduces our vocabulary from 237 to 174 words. This data is stored as a comma-separated, MATLAB readable (function 'dlmread') ASCII file 'hardAnnotations.txt'