Further Adventures in Google Refine: trials and tribulations
I’ve had a bit of a game with Google Refine recently.
Following on from the success I talked about in my last post (Adventures in … Google Refine Pt 1) things went decidely downhill
Refine was often freezing, glitching and I was losing all my data and wasting hours of work.
However, after some very helpful messages on the Google Refine Group I was able to install an updated version of Refine, and deal with the data in a more manageable way.
It appears, despite being told that Refine would be able to handle 85,000 lines of data – that breaking it up into sections of 5000 rows is wiser.
—
The process is now as follows: ( am currently at stages 6/7/8 so bear with me)
- start with 2 columns – Artist and Album
- Using FACET, row.index / 5000 I split the columns into 5000 line sections – easier to manage = less likely to crash/freeze
- For each 5000 line facet, reconcile the artist column using FREEBASE (do not attempt to fix the failed matches by using DBPedia)
- Manually work through the unmatched options with the suggestions from Freebase
- For items still unmatched, decide – researching if necessary – WHICH musical genre that album/artist belongs in. e.g. for the Now That’s What I Call Music – I flagged it as POP VARIOUS (this may eventually drop into a POP category, but I can decide that at a later time)
- Working within the 5000 line facets, CREATE A COLUMN FROM FREEBASE based on the Artist.
- Within the Musical Genre tag, constrain to 1 – I don’t want a list of MANY genres, I want just the first one (is this risky??)
- Each time I do this, it creates a NEW column, so I then TRANSPOSE that it into the current GENRE column using cells[“Musical Genres2”].value
- We are still left with blanks for all the artists that we set to a GENRE in Stage 5.
- Using, FACET BLANK we can copy all the data from ARTIST into genre by using a modification of the TRANSFORM comment above cells[“Artist”].value
- Repeat through all the facets, and tadaaaaa – 3 columns, Album, Artist and Genre
No trackbacks yet.