Further Adventures in Google Refine: trials and tribulations

I’ve had a bit of a game with Google Refine recently.
Following on from the success I talked about in my last post (Adventures in … Google Refine Pt 1) things went decidely downhill
Refine was often freezing, glitching and I was losing all my data and wasting hours of work.
However, after some very helpful messages on the Google Refine Group I was able to install an updated version of Refine, and deal with the data in a more manageable way.

It appears, despite being told that Refine would be able to handle 85,000 lines of data – that breaking it up into sections of 5000 rows is wiser.

The process is now as follows: ( am currently at stages 6/7/8 so bear with me)

  1. start with 2 columns – Artist and Album
  2. Using FACET,  row.index / 5000 I split the columns into 5000 line sections – easier to manage = less likely to crash/freeze
  3. For each 5000 line facet, reconcile the artist column using FREEBASE (do not attempt to fix the failed matches by using DBPedia)
  4. Manually work through the unmatched options with the suggestions from Freebase
  5. For items still unmatched, decide – researching if necessary – WHICH musical genre that album/artist belongs in. e.g. for the Now That’s What I Call Music – I flagged it as POP VARIOUS (this may eventually drop into a POP category, but I can decide that at a later time)
  6. Working within the 5000 line facets, CREATE A COLUMN  FROM FREEBASE based on the Artist.
  7. Within the Musical Genre tag, constrain to 1  – I don’t want a list of MANY genres, I want just the first one (is this risky??) 
  8. Each time I do this, it creates a NEW column, so I then  TRANSPOSE that it into the current GENRE column using  cells[“Musical Genres2”].value
  9. We are still left with blanks for all the artists that we set to a GENRE in Stage 5.
  10. Using, FACET BLANK we can copy all the data from ARTIST into genre by using a modification of the TRANSFORM comment above   cells[“Artist”].value
  11. Repeat through all the facets, and tadaaaaa – 3 columns, Album, Artist and Genre
Enhanced by Zemanta
  1. No trackbacks yet.

Leave a comment