Cleaning the Data: What I have and what I need
So following on from my last post (here) I am left with a hell of a big spreadsheet – 83,000+ lines in Excel format – the top 40 album charts, from Jan 1970 until Dec 2010
Spreadsheet Program
As a devotee of Google Docs (which makes flipping between work computers, iPhone and laptop a breeze) my first port of call was to import the file here.
Unfortunately Google Docs simply does not support this amount of data and I imagined smoke billowing from their servers.
Excel was a no-no as I point-blank refuse to purchase a full version of Microsoft Office and the limited version of Excel is, well, limited.
I decided to try OpenOffice
ADD: It has since been suggested to me (by @pigsonthewing and @pezholio) that I switch to LibreOffice – I’ll report back on this once I’ve had time for a full play.
The Columns
So here I am with an enormous spreadsheet with the following columns
- URL (containing date of chart)
- Artist
- Album
- Position (1-20 for the first few weeks, and 1-40 following that)
- Full date
- Year (I intend to create charts that show patterns in genre buying over years, but also comparing times of the year, months, quarters etc)
- Month
- Date
- Week Number
- Month Number
- Quarter
- Artist
- Album (won’t be used in the final chart and I am using Artist to decide on genre, but I’m keeping this in to help ID the genre when the Artist reads “Various” or “Cast”
- Genre
- Chart Position
Full Date from URL
Full Date: I needed to clean the URL, removing the extraneous data, leaving the Year, month and Date intact.
- I simply copied the entire column into Notepad (my goto application for quick and easy text editing)
- I selected common elements of the URL and simply “replaced all” with nothing.
- I was eventually left with the Date, in a Year/Month/Date format, which I pasted back into the OpenOffice spreadsheet.