Computer Audition Lab 10K (CAL10K) data set - over 10,000 songs performed by 4,597 different artists, weakly labeled from a vocabulary of over 500 tags - song-tag associations are mined from Pandora's website Douglas Turnbull Computer Audition Lab UC San Diego March 2010 You can find the original text of the relevant paper at http://web.cs.swarthmore.edu/~turnbull/Papers/Tingle_Autotag_MIR10.pdf. Music Corpus: The Swat10k data set contains 10,870 songs that are weakly-labeled4 using a tag vocabulary of 475 acoustic tags and 153 genre tags. These tags have all been harvested from Pandora's website and result from song annotations performed by expert musicologists involved with the Music Genome Project. We have attempted to collect at least 60 songs from 135 sub-genre radio stations that are produced by Pandora. All of the genres and sub-genres associated with a given song are considered genre tags. For each song in the data set, between 2 and 25 acoustic tags were downloaded from the Pandora music search engine. Because Pandora claims that their musicologists maintain a high level of agreement, we consider the song-tag annotations to be objective. Semantic Representation: We consider 628 musically-relevant concepts spanning two semantic categories: 475 acoustic characteristics 153 genres (18 genres, and 135 sub-genres) We provide 4 tar-balls with annotations for 4 different experiments: 1) only acoustic characteristic tags (5 fold cross validation) 2) only genre tags (5 fold cross validation) 3) both acoustic and genre tags 4) training on CAL10K and testing on CAL500 (5 test-folds). Annotations are restricted to the 55 tags in common between CAL500 and CAL10K. For details on CAL500, http://cosmal.ucsd.edu/cal/projects/CAL500/ Acoustic Data: EchoNestTrackIDs.tab contains the EchoNest Track IDs for each song. (http://developer.echonest.com/)