Adventures in … Google Refine Pt 1

Google Refine is a program that helps you clean up and sort messy, unsorted and incomplete data. (check out the 3 posted tutorials below for more information)

My aim, with Google Refine, was to automatically link my table of Artist and/or album with the first specified GENRE for that artist from one reliable and comprehensive source.

Google Refine helps you LINK your data to data elsewhere, and technically import any extra data linked to that.

In this case, I have Album and Artist, but need Genre

Freebase is an online directory of everything – useful linked information.

We will ask Google Refine to:

  1. Match OUR mention of an artist with an ARTIST PAGE on Freebase
  2. Look for the musical genre of that artist, pick the first one and place it in my third column.
Starting with Google Refine
  • Download and install Google Refine here
  • Prepare your data (I found CSV format more reliable than .xls or Open Office format)
  • Once you have downloaded and installed Google Refine, create a shortcut to the Refine icon.
  • Click (a systems window will open  – leave this open otherwise Refine will not work)
  • A tab will open in your browser
  • select your file (and the Advanced Options if necessary), name your project and select Create Project
  • Your data will appear as a table
  • You can edit/move and filter each column by selecting the Arrow at the top of the column
Stage 1 – matching the band
First we need to make sure that Google Refine KNOWS what bands we are talking about. We need to  RECONCILE the column.
  • Click on the arrow at the top of the column you want to reconcile
  • Select Reconcile Now
  • Select Freebase Recognition Service
  • Choose the the subject that best matches the column.
  • Click Start Reconciling
One this has finished it will have turned the items into hyperlinks
However some may not have converted, as Google Refine was unable to find a perfect match.
If it has offered some suggestions, pick the best (you can also ask it to make that selection for any other occurrences in your data)
However, if the data that has not been successfully reconciled is too large – I would suggest making another pass at reconciliation – tweaking the category choice, options etc.
In my case, I asked  that Refine takes into account the Album title when trying to match up the artist.
I make several tries at this – using different combinations of category, added data etc.
  • Musical Artist – 40% success rate
  • Musical Artist + album consideration – 50% success rate
  • Musical_Group  (applied to remaining 50%) – no additional rows reconciled
I have since discovered there are many possibly categories including
I will now use THESE terms to continue reconciling the dataset
To be continued …

Enhanced by Zemanta

Leave a comment