Data Description

The SALAMI audio data collections are both large and diverse. The range of styles, regions, and time periods included within the totality of these collections is breathtaking: A Capella to Zydeco, Appalachia to Zambia, and Medieval to Post-Modern.

As Table 1 shows, the SALAMI audio collections are drawn from a range of sources. Each source collection has its unique attributes with regard to content, access and intellectual property issues. The Internet Archive [1] (IA) collection is by far the largest with ~18,000 hours of audio. IA’s substantial collection of live concert recordings (~66,000 pieces) is particularly exciting and represents a novel source for structural analysis. The DRAM [2] collection has a special focus on the folk, jazz, orchestral, and 20th-Century avant-garde musics of the United States. The IMIRSEL and McGill collections are those built up by Downie and Fujinaga respectively for their MIR, CM and MIREX evaluation experiments. The IMIRSEL collection has two subcollections of note. First, it has the collected works of 11 major “classical” composers: Bach, Beethoven, Brahms, Chopin, Dvorak, Handel, Hayden, Mendelssohn, Mozart, Schubert, and Vivaldi. Second, it has a subcollection of ~3,000 Latin dance pieces representing 10 different genres from Axé to Tango. The McGill collection has particular strengths in the variety of “world” musics it comprises.

The raw audio files are encoded in a wide variety of audio formats ranging from low-quality MP3s to CD-quality 44.1 KHz, 16-bit wav files. Decoding these diverse formats is not an issue as the SALAMI team already has the necessary tools in hand. The underlying recording quality is another variable to be considered, however, as many recordings have been made outside of professional recordings studios using less-than-ideal equipment (for example, the IA live concert recordings). Thus, it will be part of the SALAMI project’s research work to study the effect of recording quality on the accuracy of the structural analyses performed.

Table 1. The SALAMI Audio Data Collections (Values rounded for readability)

SOURCE COLLECTION	SOURCE ACCESS TYPE	TRACKS	HOURS	SIZE (TB) UNCOMPRESSED
Internet Archive	Open via Internet	276,000 [3]	18,333	11.00
DRAM	Subscription via Internet	8,000	2,300	1.38
IMIRSEL	Closed (Stored locally)	34,000	2,267	1.36
McGill	Closed (Stored locally)	32,000	2,133	1.28
Totals		350,000	23,267	15.02

Music data is famously problematic with regard to intellectual property issues. Because of the constant fear of litigation that could shut down any future music related research, the SALAMI team members have been, and will continue to be, absolutely scrupulous in their handling of the raw music audio data. We have obtained specific access and use permissions for analytic research from IA and DRAM (see Letters section). In fact, to better ensure communications with IA and DRAM, each have agreed to place of one of their leaders on the SALAMI Advisory Board. Because we need to be ever mindful of source bandwidth resources, we are arranging with DRAM to make a hard-drive direct copy of their data collection (funding for which is already budgeted within the NEMA grant). With regard to the IA collection, the current arrangement that we have worked out with our IA representative is an http-based “trickle” feed across their collections that will be conducted over Fall 2009. The IMIRSEL and McGill collections are owned outright by UIUC and McGill respectively so no special permission is required.

It is important to stress here that neither this specific SALAMI project, nor the more general NEMA project, have any desire nor intention to deliver raw source audio files to the world. The model both projects have adopted is one which exploits the power of RDF and linked data whereby original source audio files will be uniquely identified using Uniform Resource Identifiers (URI). When a scholar makes use of one of structural analysis files, the file will contain the URI that will point to the location where the scholar can legally access the related source audio file. This model is also important for “goodwill.” In working with our source contributors, we want them to see some benefit from their efforts. Thus, when a scholar, student, or general user makes use of a structural file based upon an DRAM (or IA) recording, we acknowledge DRAM’s (or IA’s) contribution by making it clear where the source audio originated and directing them back to the contributor’s home so that the scholar or general user can be made aware of other potentially useful items in the contributing collections.

See http://www.archive.org/.
See http://www.dramonline.org/.
For feasibility reasons, the SALAMI project will be using only a subset of the enormous IA audio collection. IA informs us that the live music collection alone constitutes an astonishing 100,000 hours of music (see Letters section) which is obviously well beyond the scope of this current proposal.”