With annotators slugging away at the sample of music drawn from Codaich, we're in need of new corpora to annotate. The next big database that we want to mine is the Internet Archive (IA). Unlike Codaich, our private collection, everything at IA is freely available and (non-commercially) re-distributable. However, it is also unlike Codaich in that the song metadata is not always correct, consistent, or existent.
That means that when sampling from IA, we can't hope to enforce the even coverage of subgenres that we achieved with Codaich. In fact, because we don't have access to a list of the entirety of IA, it will be difficult to even just pick songs randomly from within it; and even if we could, we'd be faced with the fact that "music" and "audio" are not carefully distinguished by IA, and a list of all audio tracks would include radio shows, podcasts, and other audio tracks we aren't interested in (for now!).
The solution we're working on will be to pick a few important subcollections of IA and sample these randomly. One of the biggest jewels, for our purposes, is their impressive Live Music Archive, with nearly 80,000 concert recordings from 4,200 artists, most of which appear to be separated by track. IA provides a list of all artists with concert recordings, which greatly simplifies the screen-scraping process: to collect a track, we'll just need to pick a random artist, pick a random concert, and then a random mp3 to download. For each concert, IA has two metadata files that give all the available information about the song we have downloaded (and the concert it's drawn from).
We're likely to grab the bulk of our IA music in this manner, but we can use the same procedure to grab songs from other subcollections, like the collection of over 10,000 78 RPMs and cylinder recordings. The screen scraping would be a little more tedious, but we can do the same with the huge (over 440,000 songs!) Community Audio collection.
Whatever we pick to annotate, we'll be trying to automatically analyze as much of the above collections as we can in the later stages of the SALAMI project. When that happens, I'll be curious to see what we can learn about the evolution of individual Grateful Dead songs. Amazingly, IA hosts recordings of over 7,000 Dead shows; no doubt that's a lot of recordings of Truckin'!