Archive for the ‘ research ’ Category

History of a Chart 5. “Guess What?”

See also:

—————————————————————————————–
As I began to look into the subgenres of rock music, I was constantly surprised by the genres that were having an impact. I was surprised to see New Wave perform as well as it did, whilst the impact of indie and alternative in the more recent years was also of significant note.
I was looking for a way to recreate the surprise I felt from making these discoveries within a visualization.
I had examined the “reveal” mechanism in “This is the New Flesh” so I was keen to try something else.
I wanted to  challenge the users preconceptions about the data by creating a simple game. The user was asked to make their guesses for the top genres, and simply compare their guesses to the correct answers. Seeing this in action with several users, it is interesting how people are immediately drawn into an internal debate about the genres, and are surprised by the results.

Method

This was the most complicated visualization I developed and took extensive use of Action Script 3.

Unlike the other charts, the majority of this was build within Flash CS5  (the graphic content was developed in Illustrator). In this case there was no chart to export from Tableau.

  1. Plan out the idea on paper (I use paper or Ipad PhatPad). I find this to be the easiest and quickest way to imagine how the chart will look.
  2. Build the individual elements in Illustrator  – creating circles and outlines and deciding on colour schemes (dictated by the red and black theme running throughout the chart when talking about rock as opposed to the other genres).
  3. Import the elements into Flash Cs5
  4. Duplicate the elements to create as many as you need (4 in total)
  5. Convert them all into Button > Movie Clips
  6. Apply  the “Drag and Drop” snippet code to each. This allowed the element to be moved around the screen.
  7. Converting the “Reveal” circle into Movie Clip with timeline navigation qualities (move to next keyframe and stop), I was able to move the movie onto the next keyframe to reveal the results.
  8. Convert answer circles into rollover buttons to show further information about each genre
  9. Insert mini charts into “OVER” setting of button for each genre
Enhanced by Zemanta

Update, link and thanks

I’ve been incredibly quiet on this blog for the last month or so as I’ve been immersed in the world of visualization. It was always going to be my favourite part of the assignment, but that is not to say it’s been easy going.

I use the Tableau software to create the basic charts, and then  manipulate them in Illustrator and then Flash, for the animations.

I find it difficult to talk about HOW I visualize, its a very organic process with plenty of chopping and changing of settings to allow the data to “talk”. I will go through this at some point, but for now – here’s the link to the finished pieces.

Thank you to everyone who has helped out so far with advice, or suggestions. It is much appreciated.

Enhanced by Zemanta

Starting the next stage: Visualization

Intro

I’ve been spending the past 200 hours (over a series of months) gathering and cleaning up my data set as part of my final project for an MA in Online Journalism.

This has included:

  • deciding on the best data set to use
  • finding a source
  • scraping that data
  • developing my basic skills of Scraperwiki,  Yahoo Pipes, Outwit Hub
  • Cleaning up the data with a combination of Outwit Hub, Excel, Word and notepad (for stripping and replacing)
  • Checking and double checking the data for errors
  • Pivoting the data and reducing it into a usable form (converting a Top 40 list into a single line of Genre counts her chart)
So now I have reduced my 85000 lines into more like 2,000 – a lot easier to use.
I am now at the stage where I can begin to think about the visualization.
With previous projects I have used ManyEyes and Tableau (my favourite) and used the ability to change setting quickly to PLAY with the options, and try a huge amount of different charts.
As I am working within the bounds of a project, called Is Rock Dead?, I must not forget some of the main facets of the project (initially set out in my MA Proposal Document pdf)MA Online Journalism Proposal Rock is Dead-3  (NOTE: My tutor Paul Bradshaw advised that I focus on ONE element of this to avoid “mission creep”.)
So I will focus on the larger GENRE chart first and take it from there!
  • to answer the question – Is Rock Dead? Is it on its last legs?
  • Does music genre go in cycles – radio/music professionals claim rock, pop and dance/electro go in cycles over time – is this true? What are those cycles?
  • What is powering the rock genre in the charts? What is the future looking like for rock?
If I have time I would also like to produce some extra visualizations showing the breakdown of the genre in the charts,  the hot genres, the dead ones and perhaps WHY some of them peak at other times.
In order to show the pattern of genre over time, I need to produce  a version of David McCandless’ Mountains Out Of Molehills Interactive  – do I simply copy this, or produce a version of it?
I must confess, I’m not a HUGE fan of 3D charts – I think anything that is asking the eye to compare to values, should not place them on a 3rd axis …

The chart wins in the fact that NONE of the values are obscured, but it’s hard to compare them

Whilst I think this chart is very interesting, I am not sure it’s EXACTLY what I am after.

Over the next few days I will be using various visualization tools to see which are the most effective – watch out for a blog post on these at some point
I am now experimenting with various chart shapes and designs  – you can check them on Flickr out there – Id love to get your feedback on what works, what doesn’t etc!

Square pegs, round holes and a mighty big hammer – genre defining

When I started this project I was determined to filter the dozens and dozens of sub genres of music onto a few (fewer than 10) master genres. I knew this would be difficult, but I was convinced that – with some hard work, tough decisions and an emotional detachment I would get there in the end.

Why so few master genres?

My eventual aim is to create an all encompassing visualisation of the entire time scale of my project – 40 years) and any more genres than that would make the visualisation cluttered and useless. Plus many of smaller genres would simply disappear in the larger image.

However, am I removing vital elements of the visualisation if I put them in a master category?

It’s a tough one – I developed, fairly early on, a definitive list of genres – although I knew these would bend and shift.

  • pop
  • rock
  • classical / orchestral / performance / theatre
  • easy listening
  • entertainment (incl. spoken word, comedy, childrens, fitness album etc)
  • soundtrack
  • dance
  • electronic
I know this poses a ton of questions
  • Should Dance and Electronic be merged? they have a similar sound and use of instruments
  • Do I have a soundtracks category out of pure laziness? Should I not go through each of them and assign a proper genre? And if not, should there be a COMPILATION soundtrack (a category I have now merged into pop)
  • Where does Jazz go? and Reggae? What about SOUL?
  • Am I removing a key category by putting RnB into pop?
  • Easy listening – I developed this for the POPULAR music that does not belong in POP – Val Doonican etc. However, am I simply moving it out of POP because it is “old”? Also, am I just putting what I consider to be the “dull stuff” in there?

However, the biggest question is – do I need a pop category at all?

The focus of my project is ROCK – so there has to be a rock category – but does POP = ROCK in terms of a category size?

I wanted my final categories to be as equal as possible – not in size but in classification terms. If I was going to put FOLK under ROCK, then should I put RnB under POP?

Any help, advice or thoughts much appreciated …

Further Adventures in Google Refine: trials and tribulations

I’ve had a bit of a game with Google Refine recently.
Following on from the success I talked about in my last post (Adventures in … Google Refine Pt 1) things went decidely downhill
Refine was often freezing, glitching and I was losing all my data and wasting hours of work.
However, after some very helpful messages on the Google Refine Group I was able to install an updated version of Refine, and deal with the data in a more manageable way.

It appears, despite being told that Refine would be able to handle 85,000 lines of data – that breaking it up into sections of 5000 rows is wiser.

The process is now as follows: ( am currently at stages 6/7/8 so bear with me)

  1. start with 2 columns – Artist and Album
  2. Using FACET,  row.index / 5000 I split the columns into 5000 line sections – easier to manage = less likely to crash/freeze
  3. For each 5000 line facet, reconcile the artist column using FREEBASE (do not attempt to fix the failed matches by using DBPedia)
  4. Manually work through the unmatched options with the suggestions from Freebase
  5. For items still unmatched, decide – researching if necessary – WHICH musical genre that album/artist belongs in. e.g. for the Now That’s What I Call Music – I flagged it as POP VARIOUS (this may eventually drop into a POP category, but I can decide that at a later time)
  6. Working within the 5000 line facets, CREATE A COLUMN  FROM FREEBASE based on the Artist.
  7. Within the Musical Genre tag, constrain to 1  – I don’t want a list of MANY genres, I want just the first one (is this risky??) 
  8. Each time I do this, it creates a NEW column, so I then  TRANSPOSE that it into the current GENRE column using  cells[“Musical Genres2”].value
  9. We are still left with blanks for all the artists that we set to a GENRE in Stage 5.
  10. Using, FACET BLANK we can copy all the data from ARTIST into genre by using a modification of the TRANSFORM comment above   cells[“Artist”].value
  11. Repeat through all the facets, and tadaaaaa – 3 columns, Album, Artist and Genre
Enhanced by Zemanta

Tacking DBPedia: learning to SPARQL

Following on from my last post (here) I have been having a bit of a play with DBPedia.

ADD: (and I actually forgot to to say where I was writing this – thanks to @paulbradshaw for flagging that up)

I have been using Snorql

My Challenge

To create a definitive list of all musicians and bands, along with their genre

Stage 1

I thought I had it with these 2 Queries:

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/MusicalArtist>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Band>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

However, they only produced a limited number of artists. I initially thought it was some error on DBPedia’s part – I know there is a LOT of work going on behind the scenes in getting the linked data sorted out.

However, thanks to a Twitter conversation with PkLeef, I realised it was  – inevitably  – my lack of knowledge of the area.

A query of DBPedia only returns 2000 entries, or solutions, so I would need to filter the entries and carry out more queries to get the complete list.

I considered several options, such as searching by genre, but there are so many small sub-genres catered for on Wikipedia that it would take far too long.

Instead I decided upon querying bands by letter – so gather all the A’s, B’s and C’s then place them into one list.

I also  suspect there is a way to gather A’s and B’s, C’s and D’s etc

First problem, however – is how to FILTER a query. 

I look my original query above and  – in a highly illogical way and stealing bits of code from various SPARQL tutorials, I came up with this

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Band>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
FILTER(regex(?subject, "^A"))
}    ORDER BY ?subject  LIMIT 1000

Which didn’t return any solutions.

Do I need to apply the FILTER to something other than the subject? Any advice much appreciated

Round 2 commences tomorrow.

Enhanced by Zemanta

Cleaning the Data: What I have and what I need

So following on from my last post (here) I am left with a hell of a big spreadsheet – 83,000+ lines in Excel format – the top 40 album charts, from Jan 1970 until Dec 2010

Spreadsheet Program

As a devotee of Google Docs (which makes flipping between work computers, iPhone and laptop a breeze) my first port of call was to import the file here.

Unfortunately Google Docs simply does not support this amount of data and I imagined smoke billowing from their servers.

Excel was a no-no as I point-blank refuse to purchase a full version of Microsoft Office and the limited version of Excel is, well, limited.

I decided to try OpenOffice

ADD: It has since been suggested to me (by @pigsonthewing and @pezholio) that I switch to LibreOffice  – I’ll report back on this once I’ve had time for a full play.

The Columns

So here I am with an enormous spreadsheet with the following columns

  • URL (containing date of chart)
  • Artist
  • Album
  • Position (1-20 for the first few weeks, and 1-40 following that)
However, I may potentially need the following:
  • Full date
  • Year (I intend to create charts that show patterns in genre buying over years, but also comparing times of the year, months, quarters etc)
  • Month
  • Date
  • Week Number
  • Month Number
  • Quarter
  • Artist
  • Album (won’t be used in the final chart and I am using Artist to decide on genre, but I’m keeping this in to help ID the genre when the Artist reads “Various” or “Cast”
  • Genre
  • Chart Position

Full Date from URL

Full Date: I needed to clean the URL, removing the extraneous data, leaving the Year, month and Date intact.

Deleting the unwanted data by the Replace All function within a huge spreadsheet was potentially as there was a chance I could remove elements of artist name  by mistake.
  • I simply copied the entire column into Notepad (my goto application for quick and easy text editing)
  • I selected common elements of the URL and simply “replaced all” with nothing.
  • I was eventually left with the Date, in a Year/Month/Date format, which I pasted back into the OpenOffice spreadsheet.

Other Date Formats

I decided to leave the other date formats until I had dealt with the GENRE issue – the spreadsheet had the potential to be wider than the page and I wanted to save on scroll time during the GENRE insertion process.
I will return to this post and update this section soon
Enhanced by Zemanta

Data: can we have it all?

An interesting moral and legal dilemma hit me today, as I was deliberating how to scrape a large amount of data from a website.

Should I be doing this?

There has been plenty of debate, and campaigning, for free public data,and there are a host of websites that make excellent use of it for the public good(e.g. TheyWorkForYou.com, FixMyStreet). (notes from Talk About Local ’09 on Open Data). There is no moral or legal dilemma here – the data is public and has been made available for public use/manipulation.

However, what happens when the data is NOT public, and has not been released, but is located on a website in a “scrapeable” form?

There have been several examples of attempted lawsuits – where a host company claimed another firm was breaking the law by scraping their data, and potentially infringing on their business model. (USA Cvent v. Eventbrite)

This debate has also reared it’s head in the issue of newspaper paywalls – is it OK for content to be scraped over the wall for free? Search engines claim their actions actually drive business to the pay-for site, the bosses aren’t so sure.

Unfortunately it’s not as simple as companies keeping their valuable material under lock and key – often the data has a certain value in being visible online (even if that is to attract payment at a later time)- whilst another value comes in being able to sell the downloaded data to a 3rd party.

If we then muscle in, with our scrapers and pipes at the ready, are we doing wrong?

Morally, many would say this comes down to a variety of issues, the so-called “victim”, the extent of the loss and what the intentions are with the data – e.g. financial, malicious, educational.

There is another issue. Often a website will deem the “reproduction, transfer, transmission or dissemination” of the data as an infringement of their copyright.

Is this claim worth anything or does it have as much power as a “breakages must be paid for” sign in a local shop? (thanks @pigsonthewing)

Question: Albums or Singles?

Guitars / No Guitars

Portal:Guitar

Image via Wikipedia

Is it as simple as that?

Every genre that I have placed in ROCK so far has involved a guitar – that has been my “guide” for classification and has focussed the study.

I have also found a problem with Dance  / Urban due to the extensive crossover.

Perhaps I chart the path of the GUITAR in the charts, instead of a genre …

This would also then allow for modern artists like Pendulum and Prodigy (dance rock crossover) to sit within the charts.

So do I simply ask, does it have a guitar, or not?

Chart A: The Top 40 Album Charts 1999 – 2009  – Guitar or No Guitar?

Chart B: The Top 40 Album Charts 1999 – 2009  – by Genre

Enhanced by Zemanta