Cleaning the Data: What I have and what I need

So following on from my last post (here) I am left with a hell of a big spreadsheet – 83,000+ lines in Excel format – the top 40 album charts, from Jan 1970 until Dec 2010

Spreadsheet Program

As a devotee of Google Docs (which makes flipping between work computers, iPhone and laptop a breeze) my first port of call was to import the file here.

Unfortunately Google Docs simply does not support this amount of data and I imagined smoke billowing from their servers.

Excel was a no-no as I point-blank refuse to purchase a full version of Microsoft Office and the limited version of Excel is, well, limited.

I decided to try OpenOffice

ADD: It has since been suggested to me (by @pigsonthewing and @pezholio) that I switch to LibreOffice  – I’ll report back on this once I’ve had time for a full play.

The Columns

So here I am with an enormous spreadsheet with the following columns

  • URL (containing date of chart)
  • Artist
  • Album
  • Position (1-20 for the first few weeks, and 1-40 following that)
However, I may potentially need the following:
  • Full date
  • Year (I intend to create charts that show patterns in genre buying over years, but also comparing times of the year, months, quarters etc)
  • Month
  • Date
  • Week Number
  • Month Number
  • Quarter
  • Artist
  • Album (won’t be used in the final chart and I am using Artist to decide on genre, but I’m keeping this in to help ID the genre when the Artist reads “Various” or “Cast”
  • Genre
  • Chart Position

Full Date from URL

Full Date: I needed to clean the URL, removing the extraneous data, leaving the Year, month and Date intact.

Deleting the unwanted data by the Replace All function within a huge spreadsheet was potentially as there was a chance I could remove elements of artist name  by mistake.
  • I simply copied the entire column into Notepad (my goto application for quick and easy text editing)
  • I selected common elements of the URL and simply “replaced all” with nothing.
  • I was eventually left with the Date, in a Year/Month/Date format, which I pasted back into the OpenOffice spreadsheet.

Other Date Formats

I decided to leave the other date formats until I had dealt with the GENRE issue – the spreadsheet had the potential to be wider than the page and I wanted to save on scroll time during the GENRE insertion process.
I will return to this post and update this section soon
Enhanced by Zemanta
  1. August 8th, 2011

Leave a comment