Genre-fication: Assigning a Genre to an Artist and a new challenge

As per this post  – I now have the unenviable task of assigning GENRE to more than 83000 Albums.

As I have previously discussed in several posts (Genres: have I missed anything? and Guitars / No Guitars), deciding on GENRE is a tough call – it’s a very personal  decision, and as I have found there is NO definitive list.

I also had to decide how to efficiently attribute the correct genre to  thousands of lines of spreadsheet

Approach 1 – manual

The Theory

In theory, I could use speed up the process of manually adding genre, by using one of the inhouse features on OpenOffice Calc (and most domestic Word Processors) – Replace All

Method

  • create a duplicate of the Artist column – label it GENRE
  • Using Replace ALL – replace all occurrences of The with nothing  – effectively removing them to ease the next stage
  • Using the Replace ALL function, work through the spreadsheet replacing the  Band names with the respective genre. i.e. Replace All “Rolling Stones” with “rock”. Repeat seemingly endlessly.
  • We would eventually end up with a completed genre column, next to the Artist column

Problems

  • Time – this is an incredibly lengthy process.
  • Musical Knowledge – whilst I have a good working knowledge of music, there are many artists I was not aware of and had to research them to discover their genre.
  • Decision – I set the basic genres (pop, rock, dance, urban, plus a genre covering the higher-brow music – classical/jazz/opera/theatre as well as spoken word), but was open to adding more if particular artists dictated it.  However, I still had to make a definitive decision as to the genre. With my own rock bias I was concerned that this data would be flawed
  • Part-name duplication – or as I like to call it, the “James Problem”. If I automatically replaced ALL the appearances of 90s indie band James with ROCK, this would place the word ROCK in the middle of James Last, James Gallway and any other appearances of the word James.
I realized I was resorting to Wikipedia to find a definitive genre for many artists. Was there an easier, perhaps automatic way to do this?

Approach 2 – Yahoo Pipes / ScraperWiki

Unperturbed by my previous fallings out with these 2 programs, I wanted to have another bash at scraping the data from the basic site.

I managed to isolate the areas I needed within the page – the most useful part of the page for me was the INFOBOX (highlighted in red on the screengrab below)

Every artist page has a defined GENRE, all I needed was to extract this data into a nice 2 column table.

However,  there was the old problem of scraping MANY pages and there was no way of pulling just the MUSIC ARTIST pages from Wikipedia and asking for GENRE from every page would have pulled in “genre” from films, singles, plays, books etc.

However, during a conversation with Andy Mabbett (@pigsonthewing) I discovered more about DBPedia – a database project making use of the massive amounts of structured data on Wikipedia. (particular page of interest here)

Approach 3 – DBPedia / SPRQL

After my previously failed attempts at ScraperWiki/Python I was sceptical about my abilities to tackle the SPARQL query language but in fact the toughest part was understanding WHAT I should be asking for, as opposed to how.

Method

I began by studying the examples on offer:

SELECT ?subject ?label ?released ?abstract WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Film>.
?subject dbpedia2:starring <http://dbpedia.org/resource/Tom_Cruise>.
?subject rdfs:comment ?abstract.
?subject rdfs:label ?label.
FILTER(lang(?abstract) = "en" && lang(?label) = "en").
?subject <http://dbpedia.org/ontology/releaseDate> ?released.
FILTER(xsd:date(?released) < "2000-01-01"^^xsd:date).
} ORDER BY ?released
LIMIT 20

Which would return a list of Abstracts of movies starring Tom Cruise, released before 1999

In theory I would simply change the relevant parts to get what I wanted, in fact a much simpler list of the genre of every band listed on Wikipedia.

There was much of this query that I could remove, and so by trial and error began to understand how this worked

This

SELECT ?subject WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Film>.
?subject dbpedia2:starring <http://dbpedia.org/resource/Tom_Cruise>.
}

This returned a simple list of all of Tom Cruise’s films in no particular order so I knew that line 1 determined the detail provided – in short that we were asking for the SUBJECT from the class FILM that starred TOM CRUISE

I wanted to change the CLASS from Film to Music.

DBPedia categorises the THINGS into CLASSES

THINGS

  • Queen Victoria
  • Birmingham
  • Dog
  • Appetite for Destruction
CLASS of those THINGS
  • Monarch
  • City
  • Mammal
  • Album
So in the example above, the individual things are in the class of FILM
I needed to find the CLASS for musical artists.
I found 2 relevant CLASSES – Band and MusicalArtist – I decided to do 2 queries, to cover both solo performers and groups. I would later merge these 2 sets of results.
SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/MusicalArtist>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Band>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

This left me with a very extensive list of very dirty data.

Eg.

:Abhorrence 	:Death_metal
:Assailant 	:Heavy_metal_music
:Assailant 	:Progressive_metal
:Consolation 	:Grindcore
:Consolation 	:Death_metal
:Disen_Gage 	:Fusion_%28music%29

In order to clean this data up I need to remove the :  _ % and those random numbers too!

Again, if we are simply replacing characters, then Notepad is as sturdy as anything else. If we need to replace formatting (eg If I wanted to replace the TAB with a :, then I would use a word processing package in “show all characters” mode)

I also removed mentions of the word “music” (although this did affect the entries for Musical Theatre)

We can also start to group those sub-genres into the ones we need for our charts. It’s fair to group Death Metal, Heavy Metal and Progressive Metal under the broad umbrella of ROCK – so we can go through the list replacing the various sub genres with our header genres. (a list which is still fluid at this stage and to be guided by the artists we come across)

Problems

  • Multiple Genres – many of the artists are given several genre types. As I am dealing with very wide genres, I hope this will not be too great a problem
  • MAJOR PROBLEM SPOTTED: Incomplete list – the combined list of both MusicalArtists and Bands is missing some key performers, e.g. The Beatles Led Zeppelin. Is this going to be of any use to me?
I have spent the last few hours attempting to scrape Wikipedia, Yahoo Music and MusicBrainz, to no avail, in order to get a fuller artist list.
Any ideas?

Enhanced by Zemanta
  1. June 2nd, 2011

Leave a comment