Tacking DBPedia: learning to SPARQL

Following on from my last post (here) I have been having a bit of a play with DBPedia.

ADD: (and I actually forgot to to say where I was writing this – thanks to @paulbradshaw for flagging that up)

I have been using Snorql

My Challenge

To create a definitive list of all musicians and bands, along with their genre

Stage 1

I thought I had it with these 2 Queries:

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/MusicalArtist>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Band>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
}

However, they only produced a limited number of artists. I initially thought it was some error on DBPedia’s part – I know there is a LOT of work going on behind the scenes in getting the linked data sorted out.

However, thanks to a Twitter conversation with PkLeef, I realised it was  – inevitably  – my lack of knowledge of the area.

A query of DBPedia only returns 2000 entries, or solutions, so I would need to filter the entries and carry out more queries to get the complete list.

I considered several options, such as searching by genre, but there are so many small sub-genres catered for on Wikipedia that it would take far too long.

Instead I decided upon querying bands by letter – so gather all the A’s, B’s and C’s then place them into one list.

I also  suspect there is a way to gather A’s and B’s, C’s and D’s etc

First problem, however – is how to FILTER a query. 

I look my original query above and  – in a highly illogical way and stealing bits of code from various SPARQL tutorials, I came up with this

SELECT ?subject ?genre WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Band>.
?subject <http://dbpedia.org/ontology/genre> ?genre.
FILTER(regex(?subject, "^A"))
}    ORDER BY ?subject  LIMIT 1000

Which didn’t return any solutions.

Do I need to apply the FILTER to something other than the subject? Any advice much appreciated

Round 2 commences tomorrow.

Enhanced by Zemanta
  1. There is an error in your regex:

    try:

    FILTER(regex(?subject, “A.*”)

    reference: http://www.regular-expressions.info/reference.html

    email me if you need more help – i´m inclined to think that a scraper would be a better solution.

    And look into Google Refine: http://code.google.com/p/google-refine/ it is able to reconcile data agains dbpedia: http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/

    • carolinebeavon
    • June 8th, 2011

    Hey – thanks for the comment.

    I’ll give that alteration to the regex a go!

    I’ve been told about Google Refine … It’s now definitely on my list of resources, the next time I’m in MA mode.

    Thanks again – I’ll be in touch if I need more help!

  2. Your filter isn’t working because ?subject is a URI, so it always starts with “http://”, not an artist name. But you don’t need to filter at all, just use ORDER BY, LIMIT and OFFSET, increasing OFFSET by 1000 with each iteration. So first you’d do:

    SELECT ?subject ?genre WHERE {
    ?subject rdf:type .
    ?subject ?genre.
    } ORDER BY ?subject LIMIT 1000

    then

    SELECT ?subject ?genre WHERE {
    ?subject rdf:type .
    ?subject ?genre.
    } ORDER BY ?subject LIMIT 1000 OFFSET 1000

    then

    SELECT ?subject ?genre WHERE {
    ?subject rdf:type .
    ?subject ?genre.
    } ORDER BY ?subject LIMIT 1000 OFFSET 2000

    etc.

      • carolinebeavon
      • June 9th, 2011

      Hey Glenn …

      Thanks for this … I’ll have a go at that .. I knew I was doing something wrong but as was cobbling bits together, I wasn’t quite sure WHAT I was doing …

      Caroline

  1. No trackbacks yet.

Leave a comment