NOTICE: As of November 10th, 2017, has obliged us to take down the access to the MLHD.

This request leaves us with no option but to remove the dataset.

If you need further information about this circumstance, please do contact Gabriel Vigliensoni at

The Music Listening Histories Dataset (MLHD) is a large-scale collection of music listening events assembled from more than 27 billion time-stamped logs extracted from

Attractive features of the MLHD are:

Dataset structure

The logs in the dataset are organized in the form of sanitized listening histories per user, where each user has one file, with one log per line.

Each log is a quadruple <timestamp, artist-MBID, release-MBID, recording-MBID>.

In order to allow easy computation in HPC parallel systems, the dataset is distributed as TAR files with about 1K user files each. In total, the full dataset of listening histories is distributed as 576 files of about 1GB each. (Note: the file MLHD_386.tar does not have any actual listening history. It is part of the dataset just to add up to 576 files, thus facilitating the parallelization by using many combinations of factors)

Additionally, we also provide a set of text files with demographic information, as well as listening habits and behavioural data.


The dataset can be downloaded here:

Features for profiling and describing listeners in the Music Listening Histories Dataset

Accompanying the full data of the MLHD, we also provide a set of text files with additional self-declared demographic information about the listeners, a set of features for describing their listening activity, and a set of features for describing their listening behaviour. The features in these files aim to characterize specific aspects of the listeners.

Each text file comes with a header indicating the column names. The delimiter character is a tab. In order to protect the listeners' identities, all references to their usernames and IDs have been anonymized with UUIDs.

For example, the first three lines of the MLHD_demographics.csv file are:

uuid\t age\t country\t gender\t playcount\t age_scrobbles\t user_type\t registered\t firstscrobble\t lastscrobble
dfb7ea9d-6e4f-48e4-96f6-59abcc207d55\t 30\t AT\t n\t 42622\t 3783\t user\t 1035849600\t 1138630578\t 1362652343
a89cb9c5-ba84-424e-8950-16657bb6f7af\t 35\t US\t m\t 182118\t 3862\t subscrib\t 1035849600\t 1130274207\t 1369498564

Demographic features

The MLHD provides a set of features describing some of the listeners' demographic characteristics.

At the moment of registration, asks the listeners to declare their year of birth, gender, and country. The listeners' age are updated automatically by the service, and gender and country can be updated at any time by them.

The age reported in the dataset is the age returned by the system at the moment of the data collection (i.e., circa 2013 and 2014). gender and country are the listeners' self-declared gender and country. Since they have the option to do not declare, in the dataset we assigned the value NA for non-declared country (and we changed the country code for Namibia to NB), and n for gender not declared.

The playcounts column returns the total number of logs within each listener's music listening history.

The age_scrobbles field is the number of days that passed between the first and the last logs recorded in the dataset.

The user type column returns the "user type" assigned by to each user according to their involvement with the service: "subscrib" are those that paid a monthly installment to for getting unlimited streaming tracks and no ads, "user" are people without any special privileges in the "freemium" pricing strategy; "staff", "moderator", and "alumni" are statuses for people that are currently working for, or that worked previously for the service.

The registered, firstscrobble, and lastscrobble columns return Unix (UTC) timestamps for each listener's registration, first submitted log to, and the last log stored in the MLHD.

The CSV file can be downloaded from the following link:

Listening behavioural features

In order characterize listening behaviours, we provide in the MLHD a set of four computational features tailored to to represent some characteristics of music listening behaviours. The features are exploratoryness, mainstreamness, genderedness, and fringeness. Values for these features were computed for the three types of music items in the dataset: artists, albums, and tracks. Therefore, each listener’s listening profile is described by a vector of 12 continuous values. For details about each of these features formulations, please refer to Vigliensoni and Fujinaga (2016).

User activity features

We computed from the music listening histories’ UTC timestamps a series of features that aggregated the number of logs of each listening history into several time spans. These low-dimensional representations of user activity are: hourly activity per day, hourly activity by week hour, weekly activity, monthly activity, and yearly activity. Values in each column represent the percentage of the total of listening logs per user for each span of time.

Scientific references

For details about how the dataset was assembled and how the features were computed, do check the following scientific publications:

If you use the dataset, please do cite the following publication:

    Author = {Vigliensoni, Gabriel and Fujinaga, Ichiro},
    Title = {The music listening histories dataset},
    Booktitle = {Proceedings of the 18th International Society for Music Information Retrieval Conference},
    Address = {Suzhou, People's Republic of China},
    Pages = {96--102},
    Year = {2017},
    Keywords = {Dataset, Listening behaviour, Music Preference,},


We are extremely grateful of all users of that have agreed to make their data available for non-commercial use, and also to the service, which has collected and offered this data since 2002 uninterruptedly, helping the field of music informatics research to move forward.

This research has been supported by BecasChile Bicentenario, Comision Nacional de Investigacion Cientifica y Tecnologica, Gobierno de Chile, and the Social Sciences and Humanities Research Council of Canada. Important parts of this work used ComputeCanada’s High Performance Computing resources.

For additional information, questions, problems, or feedback please contact Gabriel Vigliensoni at In case of issues, please do include as much detail as possible.