Adventures in … Google Refine Pt 1

Google Refine is a program that helps you clean up and sort messy, unsorted and incomplete data. (check out the 3 posted tutorials below for more information)

My aim, with Google Refine, was to automatically link my table of Artist and/or album with the first specified GENRE for that artist from one reliable and comprehensive source.

Google Refine helps you LINK your data to data elsewhere, and technically import any extra data linked to that.

In this case, I have Album and Artist, but need Genre

Freebase is an online directory of everything – useful linked information.

We will ask Google Refine to:

  1. Match OUR mention of an artist with an ARTIST PAGE on Freebase
  2. Look for the musical genre of that artist, pick the first one and place it in my third column.
Starting with Google Refine
  • Download and install Google Refine here
  • Prepare your data (I found CSV format more reliable than .xls or Open Office format)
  • Once you have downloaded and installed Google Refine, create a shortcut to the Refine icon.
  • Click (a systems window will open  – leave this open otherwise Refine will not work)
  • A tab will open in your browser
  • select your file (and the Advanced Options if necessary), name your project and select Create Project
  • Your data will appear as a table
  • You can edit/move and filter each column by selecting the Arrow at the top of the column
Stage 1 – matching the band
First we need to make sure that Google Refine KNOWS what bands we are talking about. We need to  RECONCILE the column.
  • Click on the arrow at the top of the column you want to reconcile
  • Select Reconcile Now
  • Select Freebase Recognition Service
  • Choose the the subject that best matches the column.
  • Click Start Reconciling
One this has finished it will have turned the items into hyperlinks
However some may not have converted, as Google Refine was unable to find a perfect match.
If it has offered some suggestions, pick the best (you can also ask it to make that selection for any other occurrences in your data)
However, if the data that has not been successfully reconciled is too large – I would suggest making another pass at reconciliation – tweaking the category choice, options etc.
In my case, I asked  that Refine takes into account the Album title when trying to match up the artist.
I make several tries at this – using different combinations of category, added data etc.
  • Musical Artist – 40% success rate
  • Musical Artist + album consideration – 50% success rate
  • Musical_Group  (applied to remaining 50%) – no additional rows reconciled
I have since discovered there are many possibly categories including
Band
Musician
Performer
I will now use THESE terms to continue reconciling the dataset
To be continued …

Enhanced by Zemanta

Tacking DBPedia: learning to SPARQL

Following on from my last post (here) I have been having a bit of a play with DBPedia.

ADD: (and I actually forgot to to say where I was writing this – thanks to @paulbradshaw for flagging that up)

I have been using Snorql

My Challenge

To create a definitive list of all musicians and bands, along with their genre

Stage 1

I thought I had it with these 2 Queries:

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/MusicalArtist>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Band>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

However, they only produced a limited number of artists. I initially thought it was some error on DBPedia’s part – I know there is a LOT of work going on behind the scenes in getting the linked data sorted out.

However, thanks to a Twitter conversation with PkLeef, I realised it was  – inevitably  – my lack of knowledge of the area.

A query of DBPedia only returns 2000 entries, or solutions, so I would need to filter the entries and carry out more queries to get the complete list.

I considered several options, such as searching by genre, but there are so many small sub-genres catered for on Wikipedia that it would take far too long.

Instead I decided upon querying bands by letter – so gather all the A’s, B’s and C’s then place them into one list.

I also  suspect there is a way to gather A’s and B’s, C’s and D’s etc

First problem, however – is how to FILTER a query. 

I look my original query above and  – in a highly illogical way and stealing bits of code from various SPARQL tutorials, I came up with this

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Band>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
FILTER(regex(?subject, "^A"))
}    ORDER BY ?subject  LIMIT 1000

Which didn’t return any solutions.

Do I need to apply the FILTER to something other than the subject? Any advice much appreciated

Round 2 commences tomorrow.

Enhanced by Zemanta

Genre-fication: Assigning a Genre to an Artist and a new challenge

As per this post  – I now have the unenviable task of assigning GENRE to more than 83000 Albums.

As I have previously discussed in several posts (Genres: have I missed anything? and Guitars / No Guitars), deciding on GENRE is a tough call – it’s a very personal  decision, and as I have found there is NO definitive list.

I also had to decide how to efficiently attribute the correct genre to  thousands of lines of spreadsheet

Approach 1 – manual

The Theory

In theory, I could use speed up the process of manually adding genre, by using one of the inhouse features on OpenOffice Calc (and most domestic Word Processors) – Replace All

Method

  • create a duplicate of the Artist column – label it GENRE
  • Using Replace ALL – replace all occurrences of The with nothing  – effectively removing them to ease the next stage
  • Using the Replace ALL function, work through the spreadsheet replacing the  Band names with the respective genre. i.e. Replace All “Rolling Stones” with “rock”. Repeat seemingly endlessly.
  • We would eventually end up with a completed genre column, next to the Artist column

Problems

  • Time – this is an incredibly lengthy process.
  • Musical Knowledge – whilst I have a good working knowledge of music, there are many artists I was not aware of and had to research them to discover their genre.
  • Decision – I set the basic genres (pop, rock, dance, urban, plus a genre covering the higher-brow music – classical/jazz/opera/theatre as well as spoken word), but was open to adding more if particular artists dictated it.  However, I still had to make a definitive decision as to the genre. With my own rock bias I was concerned that this data would be flawed
  • Part-name duplication – or as I like to call it, the “James Problem”. If I automatically replaced ALL the appearances of 90s indie band James with ROCK, this would place the word ROCK in the middle of James Last, James Gallway and any other appearances of the word James.
I realized I was resorting to Wikipedia to find a definitive genre for many artists. Was there an easier, perhaps automatic way to do this?

Approach 2 – Yahoo Pipes / ScraperWiki

Unperturbed by my previous fallings out with these 2 programs, I wanted to have another bash at scraping the data from the basic site.

I managed to isolate the areas I needed within the page – the most useful part of the page for me was the INFOBOX (highlighted in red on the screengrab below)

Every artist page has a defined GENRE, all I needed was to extract this data into a nice 2 column table.

However,  there was the old problem of scraping MANY pages and there was no way of pulling just the MUSIC ARTIST pages from Wikipedia and asking for GENRE from every page would have pulled in “genre” from films, singles, plays, books etc.

However, during a conversation with Andy Mabbett (@pigsonthewing) I discovered more about DBPedia – a database project making use of the massive amounts of structured data on Wikipedia. (particular page of interest here)

Approach 3 – DBPedia / SPRQL

After my previously failed attempts at ScraperWiki/Python I was sceptical about my abilities to tackle the SPARQL query language but in fact the toughest part was understanding WHAT I should be asking for, as opposed to how.

Method

I began by studying the examples on offer:

SELECT ?subject ?label ?released ?abstract WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Film>.
?subject dbpedia2:starring <http://dbpedia.org/resource/Tom_Cruise>.
?subject rdfs:comment ?abstract.
?subject rdfs:label ?label.
FILTER(lang(?abstract) = "en" && lang(?label) = "en").
?subject <http://dbpedia.org/ontology/releaseDate> ?released.
FILTER(xsd:date(?released) < "2000-01-01"^^xsd:date).
} ORDER BY ?released
LIMIT 20

Which would return a list of Abstracts of movies starring Tom Cruise, released before 1999

In theory I would simply change the relevant parts to get what I wanted, in fact a much simpler list of the genre of every band listed on Wikipedia.

There was much of this query that I could remove, and so by trial and error began to understand how this worked

This

SELECT ?subject WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Film>.
?subject dbpedia2:starring <http://dbpedia.org/resource/Tom_Cruise>.
}

This returned a simple list of all of Tom Cruise’s films in no particular order so I knew that line 1 determined the detail provided – in short that we were asking for the SUBJECT from the class FILM that starred TOM CRUISE

I wanted to change the CLASS from Film to Music.

DBPedia categorises the THINGS into CLASSES

THINGS

  • Queen Victoria
  • Birmingham
  • Dog
  • Appetite for Destruction
CLASS of those THINGS
  • Monarch
  • City
  • Mammal
  • Album
So in the example above, the individual things are in the class of FILM
I needed to find the CLASS for musical artists.
I found 2 relevant CLASSES – Band and MusicalArtist – I decided to do 2 queries, to cover both solo performers and groups. I would later merge these 2 sets of results.
SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/MusicalArtist>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Band>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

This left me with a very extensive list of very dirty data.

Eg.

:Abhorrence 	:Death_metal
:Assailant 	:Heavy_metal_music
:Assailant 	:Progressive_metal
:Consolation 	:Grindcore
:Consolation 	:Death_metal
:Disen_Gage 	:Fusion_%28music%29

In order to clean this data up I need to remove the :  _ % and those random numbers too!

Again, if we are simply replacing characters, then Notepad is as sturdy as anything else. If we need to replace formatting (eg If I wanted to replace the TAB with a :, then I would use a word processing package in “show all characters” mode)

I also removed mentions of the word “music” (although this did affect the entries for Musical Theatre)

We can also start to group those sub-genres into the ones we need for our charts. It’s fair to group Death Metal, Heavy Metal and Progressive Metal under the broad umbrella of ROCK – so we can go through the list replacing the various sub genres with our header genres. (a list which is still fluid at this stage and to be guided by the artists we come across)

Problems

  • Multiple Genres – many of the artists are given several genre types. As I am dealing with very wide genres, I hope this will not be too great a problem
  • MAJOR PROBLEM SPOTTED: Incomplete list – the combined list of both MusicalArtists and Bands is missing some key performers, e.g. The Beatles Led Zeppelin. Is this going to be of any use to me?
I have spent the last few hours attempting to scrape Wikipedia, Yahoo Music and MusicBrainz, to no avail, in order to get a fuller artist list.
Any ideas?

Enhanced by Zemanta

Cleaning the Data: What I have and what I need

So following on from my last post (here) I am left with a hell of a big spreadsheet – 83,000+ lines in Excel format – the top 40 album charts, from Jan 1970 until Dec 2010

Spreadsheet Program

As a devotee of Google Docs (which makes flipping between work computers, iPhone and laptop a breeze) my first port of call was to import the file here.

Unfortunately Google Docs simply does not support this amount of data and I imagined smoke billowing from their servers.

Excel was a no-no as I point-blank refuse to purchase a full version of Microsoft Office and the limited version of Excel is, well, limited.

I decided to try OpenOffice

ADD: It has since been suggested to me (by @pigsonthewing and @pezholio) that I switch to LibreOffice  – I’ll report back on this once I’ve had time for a full play.

The Columns

So here I am with an enormous spreadsheet with the following columns

  • URL (containing date of chart)
  • Artist
  • Album
  • Position (1-20 for the first few weeks, and 1-40 following that)
However, I may potentially need the following:
  • Full date
  • Year (I intend to create charts that show patterns in genre buying over years, but also comparing times of the year, months, quarters etc)
  • Month
  • Date
  • Week Number
  • Month Number
  • Quarter
  • Artist
  • Album (won’t be used in the final chart and I am using Artist to decide on genre, but I’m keeping this in to help ID the genre when the Artist reads “Various” or “Cast”
  • Genre
  • Chart Position

Full Date from URL

Full Date: I needed to clean the URL, removing the extraneous data, leaving the Year, month and Date intact.

Deleting the unwanted data by the Replace All function within a huge spreadsheet was potentially as there was a chance I could remove elements of artist name  by mistake.
  • I simply copied the entire column into Notepad (my goto application for quick and easy text editing)
  • I selected common elements of the URL and simply “replaced all” with nothing.
  • I was eventually left with the Date, in a Year/Month/Date format, which I pasted back into the OpenOffice spreadsheet.

Other Date Formats

I decided to leave the other date formats until I had dealt with the GENRE issue – the spreadsheet had the potential to be wider than the page and I wanted to save on scroll time during the GENRE insertion process.
I will return to this post and update this section soon
Enhanced by Zemanta

Getting the Data: Outwit Hub

So, after weeks of wrestling with Yahoo Pipes and ScraperWiki, I finally succumbed  to the pay-model of OutWit Hub (more on this below).

I know, I know – I sold out. However, the £25 (or so) for the full version of this web scraper tool has saved me literally hours of work and means I can get stuck into the data itself.

I was also reluctant to ask for assistance from the several Python and Yahoo Pipes experts who offered assistance as I would have to rely on them TOO MUCH at this stage.

NOTE: I fully intend to return to this problem at a later date and finally crack this, as it as frustrated me no end … Yahoo Pipes and ScraperWiki, it’s not over!

Outwit Hub

The options available on OutWit Hub

Outwit Hub: the left hand column showing search options available

 

Outwit Hub is a web data tool (limited version available free as a Firefox extension) that offers a feast of insights into a page including images, lists, tables and links (see image left)

There is also the option to set up your own scraper by setting search terms at the beginning and end of the data that you want.

The information drops into a window, which can be automatically filtered and moved with a “catch” function into another window.

I wanted to scrape a large number of pages, and whilst Outwit Hub has a function which allows a scraper to run over an entire site by using the “next” page link, the site I was scraping did not have this “next” structure.

This was my solution:

  • I asked OutWit Hub to search the entire site for ALL the URL’s of the pages I needed –  specifying them by a common term. (This required me to purchase the full version of Outwit Hub – as the free version limits you to 100 items within a search – far fewer than I required)
  • I then set up a scraper and asked it to search these URL’s for Artist, Album title and position. (However, I knew at a later stage that I would also need the DATE of the chart, something I would have to manipulate from the URL)
  • Once the scraper had found the correct information I saved it, and exported it into Excel.
Enhanced by Zemanta

Data: can we have it all?

An interesting moral and legal dilemma hit me today, as I was deliberating how to scrape a large amount of data from a website.

Should I be doing this?

There has been plenty of debate, and campaigning, for free public data,and there are a host of websites that make excellent use of it for the public good(e.g. TheyWorkForYou.com, FixMyStreet). (notes from Talk About Local ’09 on Open Data). There is no moral or legal dilemma here – the data is public and has been made available for public use/manipulation.

However, what happens when the data is NOT public, and has not been released, but is located on a website in a “scrapeable” form?

There have been several examples of attempted lawsuits – where a host company claimed another firm was breaking the law by scraping their data, and potentially infringing on their business model. (USA Cvent v. Eventbrite)

This debate has also reared it’s head in the issue of newspaper paywalls – is it OK for content to be scraped over the wall for free? Search engines claim their actions actually drive business to the pay-for site, the bosses aren’t so sure.

Unfortunately it’s not as simple as companies keeping their valuable material under lock and key – often the data has a certain value in being visible online (even if that is to attract payment at a later time)- whilst another value comes in being able to sell the downloaded data to a 3rd party.

If we then muscle in, with our scrapers and pipes at the ready, are we doing wrong?

Morally, many would say this comes down to a variety of issues, the so-called “victim”, the extent of the loss and what the intentions are with the data – e.g. financial, malicious, educational.

There is another issue. Often a website will deem the “reproduction, transfer, transmission or dissemination” of the data as an infringement of their copyright.

Is this claim worth anything or does it have as much power as a “breakages must be paid for” sign in a local shop? (thanks @pigsonthewing)

Yahoo Pipes and the gnashing of teeth

So now I’ve decided to use the Official Chart Company archive of weekly Album charts (example page here) I now need to work out how to get all of this onto one spreadsheet.

I’ve tried several methods … Including Yahoo Pipes (my attempt here) however I’ve hit a few snags along the way!

Any assistance or advice much appreciated!

20110413-195527.jpg

Have I been looking at charts too long – or does this kinda make sense?

Ok, it’s kinda crazy, but give it a moment – it kinda works … right? (shout up if not … I may have been staring atthis thing for too long)

If I wanted to know the general trends of a music genre over time, with an idea of what had happened along the way – the POLYGON option on Tableau seems to do it.

It also works without explanation, I think.

However, if you DO want a few more details – here goes:

I took the Top 10 albums of the Year from 1980 – 2009,  then cateorized each one and gave it a rating- depending on its chart position.

I added up the total values for each genre for each year, and  – ta-daaaa. The chart above.

Does it take anything away?

I don’t believe so – in this case it shows the ebb and flow, and interestingly. the battle between Pop and Rock for the top spot.

Unfortunately it doesn’t bode well for Rock does it?

NOTE: I love the fact the blue peak shows the popularity of the Now and other compilation tapes in the mid-late 80’s.

Modifications?

Perhaps switch the position of Urban and Compilation – so both polygons are visible.

I’m also going to revisit my genre categorisation for Dance in the 80s – that seems unusually high.

Any thoughts? Please feel free to comment below …

1’s and 0’s, a monster spreadsheet and a pretty/pointless chart

So last night I finally took the bull by the horns and got stuck into some Chart analysis.

I have spent the last few days looking for that PERFECT set of Album chart data, and I realize that search could take me forever.

I want data that:

  • covers a substantial timescale  – at least back to 1980 (beyond if possible)
  • is not too detailed (i.e “of the year” as opposed to monthly or weekly – which would look cluttered as I’m covering such a  such a long time period)
  • Gives a decent number of entries (ideally Top 40)

So, ideally I am looking for the end of year Album Charts from 1980 – 2010

Oh, and whilst the Official Charts Company has an amazing archive of retro album charts (here), it’s a NIGHTMARE to extract (although if I was any good at ScraperWiki I may be able to nail it) and is not available to send out (according to a recent email conversation with the OCC)

And I found it, here: Chart Archive – everyHit.com

The site has helpfully listed the Top selling albums for each year from the 60’s to the present day. The only downside is that it only provides the Top 10 albums, not the Top 40

(I have contacted the site to see if they have larger charts, I am awaiting a response)

Is this enough? Are the Top 10 selling albums of a year, a good enough representation of musical tastes?

I may make up for this, and use the time I am saving in having to processes a Top 40, by taking my research back to 1970.

——–

So last night I started  a MONSTER spreadsheet (I promise to make this available online as soon as possible)

I have still not decided fully on the DATA I wish to use so I’m doing as much process now, so I can decide which one works the best.

The left hand columns are the Top Ten albums from 1980 – 2010

I then marked each one, either No GUITAR, ACOUSTIC GUITAR or ELECTRIC GUITAR with a 1

I repeated this process for GENRE: (rock, dance, urban, pop, compilation, classical/choral/world)

I then created a similar table (in red) which generated the RANKING of each album (10 points for the number 1 album, 9 number 2 etc)

E.g.

1 x(11-chart position)  =ranking

If the square had no “1”entry the calculation would = 0

I then created a series of other tables which colalted each years results

(lesson learnt: Google Docs is no where near as useful as full-blown Excel when it comes to linking data within a document and over different sheets.)


I also created a similar chart for simply Top 10 presence – removing the RATING system.

QUESTION: When you are analyzing a chart, should you take position into account? Or does that distort the facts?

This is the final data I hope to use for my master vizualisation.

Here are a fun little chartsI created (thanks Google Docs for this one) – not sure I’m convinced by the layout, but it sure looks pretty!

Enhanced by Zemanta

Question: Albums or Singles?