Playing in PHP with the World Music Charts Dataset

This page describes some of the quick-and-dirty scripting that I did in preparation for our first meeting on the Lyrics Project (part of the fledgling CiteLab that includes Andrew Piper, Mark Algee-Hewitt and me). The focus in what follows is to narrativize the technical process and code, it’s not (yet) to embark on interpretive meanderings of what we encounter.

One of the more comprehensive and easily accessible databases that we found for the Lyrics Project is the World Music Charts, compiled by Steve Hawtin and contributors. There’s even a handy FAQ page about whether or not spidering of the site is permitted, with a tentative offer to download the full CSV file. As with any exercise like this, it’s worth spending time trying to hunt down the best sources of data available, and this seems like a good choice.

The file contains chart information for songs and albums from various top hits charts across the world (the international scope is great), though it’s weaker in terms of additional metadata, such as musical genre (song length, record label, etc.). And, despite the name of our Lyrics Project, there are no lyrics included.

Still, we thought it would be interesting to begin experimenting with the data, seeing what preliminary things we might discover about trends over time, including for the length of song titles, the repetition of words in titles, and vocabulary trends across decades.

It would have been perfectly legitimate to begin by downloading the data file and starting to do coding from there, but I decided to make downloading of the file part of the scripting task (grabbing multiple data files or dynamic files is often a crucial part of such work, even if it’s not as relevant here).

I’ve written the code here in PHP (though it’s not my favourite language, it does have the merit of being very versatile, especially for the development of web applications). There’s an almost infinite number of ways of writing and running PHP code, but – FYI – for small, experimental projects like this I tend to work in TextMate and run the file directly from the editor (⌘r), not through a server.

Let’s begin by setting the stage:


// set the mime type in case we're running through a web page
header("Content-type: text/plain");

// define the data directory where things will be stored
$datadir = dirname(__FILE__) . '/../data';

// ensure that we can write data
if (!file_exists($datadir)) {
    if (!mkdir($datadir, 0777, true)) {
        die("Unable to create directory: $datadir");

Now we want to grab the data file from the World Music Charts site. We first check to see if the file is available locally already. If not, we’ll go fetch the most recent version and save it to our data directory.

// ensure that songs data is available
if (!glob("$datadir/tsort-chart*csv")) {
    // get new filename (which changes as new versions are posted)
    $spiderFaqUrl = '';
    $spiderFaq = file_get_contents($spiderFaqUrl);
    if (!$spiderFaq) {die("Unable to load spider faq URL: $spiderFaqUrl");}
    if (!preg_match("/tsort-chart[\d-]+\.csv/", $spiderFaq, $match)) {
        die("Unable to find songs CSV file in URL $spiderFaqUrl");

    // now fetch contents
    $songsCsvUrl = '' . $match[0];
    $songsCsvFilename = basename($songsCsvUrl);
    if (!file_exists("$datadir/$songsCsvFilename")) {
        // do a simple (if slightly less efficient) read/write of full file
        $songsCsv = file_get_contents($songsCsvUrl);
        if (!$songsCsv) {
            die('Unable to fetch contents of ' . $songsCsvUrl);
        if (!file_put_contents("$datadir/$songsCsvFilename", $songsCsv)) {
            die('Unable to write of ' . $songsCsvUrl);

Once we’re sure we have the data locally, we can look at every line of the CSV file and build a multi-dimensional array with years, type (artists, song titles, album names), and labels. By creating a hash (key, value pairs) with the labels, we ensure that only one instance of every label is considered per year (unique artists, song titles, album names).

// read the data file
$years = array();
$songsCsvFile = array_pop(glob("$datadir/tsort-chart*csv"));
if (($handle = fopen($songsCsvFile, "r")) !== FALSE) {
    while (($data = fgetcsv($handle)) !== FALSE) {
        if ($data[3] && $data[3] != 'unknown') { // skip unknown years
            if ($data[2]=='song') { // entry type is a song
            else if ($data[2]=='album') { // entry type is an album
else {die("Unable to read songs CSV file from data directory: $datadir");}
ksort($years); // sort the years array in ascending order

Now we have all of this juicy data in memory, we can start transforming it into more directly usable forms. We’ll begin by dumping out the labels for artists, song titles and albums into separate text files, organized by year and also by decade.

// output text data by year
foreach ($years as $year => $data) { // look at each year
    foreach ($data as $type => $labels) { // type: songs, albums, artists
        if ($labels) {
            $contents = array_keys($labels); // grab all labels
            $contents = implode("\n", $contents); put one on each line
            $basedir = "$datadir/tinfo-$type/years";
            if (!file_exists($basedir)) {mkdir($basedir, 0777, true);}
            file_put_contents("$basedir/$year.txt", $contents);
// create decade files
foreach (array_keys($data) as $type) {
    $yearsdir = "$datadir/tinfo-$type/years";
    $basedir = "$datadir/tinfo-$type/decades";
    if (!file_exists($basedir)) {mkdir($basedir, 0777, true);}
    for ($i=190;$i<220;$i++) { // decade prefix
        $files = glob("$yearsdir/$i*");
        if ($files) {
            $contents = '';
            foreach ($files as $file) {
                $contents .= file_get_contents($file);
            file_put_contents("$basedir/$i" . "0s.txt", $contents);

We can also go through our accumulated labels and measure average lengths and lexical density for each year. To calculate the average label length, we divide the total number of words by the total number of labels. To calculate the lexical density, or vocabulary richness, we divide the total number of words by the total number of unique words (token/type ratio). We write those columns into a file and then read the file back to the console as output.

// output tabular data
$types = array('songs', 'albums', 'artists');
$file = "$datadir/tsort-by-year.txt";
if (($handle = fopen($file, "w")) !== FALSE) {
    fwrite($handle, 'year');
    foreach ($types as $type) {
        fwrite($handle, "\t" . implode("\t", array($type,"$type year density","$type length")));
    fwrite($handle, "\n");
    foreach ($years as $year => $data) {
        fwrite($handle, $year);
        foreach ($types as $type) {
            $labels = $years[$year][$type];
            if ($labels) {
                $words = array();
                foreach (array_keys($labels) as $label) {
                    foreach (str_word_count(strtolower($label), 1) as $word) {
                $totalWords = array_sum($words);
                $density =  $totalWords /  count($words); // token/type ratio
                $mean_length = $totalWords / count($data[$type]); 
                fwrite($handle, "\t" . implode("\t", array(count($data[$type]), $density, $mean_length)));
        fwrite($handle, "\n");
    echo $file, "\n\n", file_get_contents($file);
else {die("Unable to write data for $type: $file");}

And presto! We’ve now created some initial data files that we can start playing with.

Let’s first generate some trend graphs. We could do that in our favourite spreadsheet program, but those tend to be less interactive and easy to share (with the possible exception of Google Spreadsheets). So we’ll load our World Music Charts data into ManyEyes (you need an account to upload data, but you can visualize existing datasets without an account).

The chart below shows the average length of individual song titles as well as the lexical density (vocabulary richness) of the combined song titles by year.

Title Length and Lexical Density

This is a static image, but click on it to view an interactive version.

We can see a gradual trend of shortening titles over time, with some erratic behaviour early in the dataset (1900 to about 1920). Just as individual song titles seem to get shorter over time, the vocabulary richness seems to decrease as well (the upward slant of the bottom line indicates that the total number of words divided by the number of unique words is increasing – note that this is a token/type ratio instead of the more conventional type/token ratio, so higher values indicate less vocabulary richness).

  • the cat in the hat = 5/4 = 1.25 (lower vocabulary richness, higher score)
  • the cat in a hat = 5/5 = 1 (higher vocabulary richness, lower score)

These trends seem promising, but we probably need to dig further into the actual texts in order to better understand what’s happening. Let’s return to our data directory and create a zip archive the files in the tinfo-songs/decades folder.

Create an archive of files

In English the interfaces says something like "Archive" on a Mac.

This will create a file called – we’re going to head over to Voyant Tools and upload the file to create a new Voyant Tools corpus.

Voyant Tools

By default the result will contain all of the words, including function words like the, a, and, etc. By clicking on the options (gear) icon of the Cirrus word cloud, we can select the English stopword list and produce a more interesting picture:

Among the most notable terms are love and blues. But surely those aren’t evenly distributed across time. One way to find out is to open up the “Words in the Corpus” panel in the bottom left-hand corner and select the two words. What emerges is that the popularity of blues in the title peaked around the 1920s but that the fortunes of love have steadily risen. It’s useful to take a step back and remind ourselves about the variable nature of the data – this is a compilation of chart listings. Because genre information is not included in our data, what we might be witnessing is the peak of the blues as a genre, but not necessarily that the word blues appears less often over time in blues music.

In the Summary panel (left column in the middle) we can see an overview of the distinctive words for each decade:

  • 1900s: old (32), home (24), uncle (16), girl (24), good-bye (12)
  • 1910s: home (39), old (33), little (39), gems (21), girl (33)
  • 1920s: blues (478), man (72), mama (38), old (49), i’m (75)
  • 1930s: blues (294), old (80), moon (51), little (84), blue (61)
  • 1940s: blues (169), boogie (33), nao (30), old (52), que (30)
  • 1950s: baby (124), heart (90), love (330), boogie (35), little (87)
  • 1960s: baby (148), la (101), girl (100), lonely (47), come (76)
  • 1970s: love (408), woman (72), rock (65), man (95), roll (45)
  • 1980s: love (452), night (90), heart (92), like (72), time (84)
  • 1990s: love (428), world (74), life (61), heaven (43), live (39)
  • 2000s: life (49), sorry (24), world (47), like (54), ya (21)
  • 2010s: like (6), tonight (4), blah (3), just (4), night (4)

A few words that stand out for me:

  • boogie (1950s) a useful reminder that the term predates disco
  • nao  and que (1940s) this is showing the international scope of the charts
  • heaven (1990s) the material world getting celestial?

Actually, this is a perfect kind of corpus to view using correspondence analysis, where terms are clustered around documents. This is a bit awkward to do at the moment in Voyant, but you can export the corpus (click on the save/diskette icon in the top, blue bar; click on the first link; add &skin=scatter&stopList=stop.en.taporware.txt to the end of the URL, or load the corpus in the stand-alone tool:

We can see some interesting clusters of words, but we can also see how documents arrange themselves in somewhat logical positions, based solely on the frequency of terms in each decade of titles.

This is obviously just scratching the surface, but the point was more to suggest how we can relatively quickly start working on a new corpus and explore some promising phenomena.

2 thoughts on “Playing in PHP with the World Music Charts Dataset

  1. Pingback: CiteLab Lyrics Project Work Session | Stéfan Sinclair

  2. Pingback: Short Takes: Playing in PHP with the World Music Charts Dataset : Global Perspectives on Digital History

Leave a Reply

Your email address will not be published. Required fields are marked *