Tacking DBPedia: learning to SPARQL
Following on from my last post (here) I have been having a bit of a play with DBPedia.
ADD: (and I actually forgot to to say where I was writing this – thanks to @paulbradshaw for flagging that up)
I have been using Snorql
My Challenge
To create a definitive list of all musicians and bands, along with their genre
Stage 1
I thought I had it with these 2 Queries:
SELECT ?subject ?genre WHERE { ?subject rdf:type <http://dbpedia.org/ontology/MusicalArtist>. ?subject <http://dbpedia.org/ontology/genre> ?genre. } SELECT ?subject ?genre WHERE { ?subject rdf:type <http://dbpedia.org/ontology/Band>. ?subject <http://dbpedia.org/ontology/genre> ?genre. }
However, they only produced a limited number of artists. I initially thought it was some error on DBPedia’s part – I know there is a LOT of work going on behind the scenes in getting the linked data sorted out.
However, thanks to a Twitter conversation with PkLeef, I realised it was – inevitably – my lack of knowledge of the area.
A query of DBPedia only returns 2000 entries, or solutions, so I would need to filter the entries and carry out more queries to get the complete list.
I considered several options, such as searching by genre, but there are so many small sub-genres catered for on Wikipedia that it would take far too long.
Instead I decided upon querying bands by letter – so gather all the A’s, B’s and C’s then place them into one list.
I also suspect there is a way to gather A’s and B’s, C’s and D’s etc
First problem, however – is how to FILTER a query.
I look my original query above and – in a highly illogical way and stealing bits of code from various SPARQL tutorials, I came up with this
SELECT ?subject ?genre WHERE { ?subject rdf:type <http://dbpedia.org/ontology/Band>. ?subject <http://dbpedia.org/ontology/genre> ?genre. FILTER(regex(?subject, "^A")) } ORDER BY ?subject LIMIT 1000
Which didn’t return any solutions.
Do I need to apply the FILTER to something other than the subject? Any advice much appreciated
Round 2 commences tomorrow.
There is an error in your regex:
try:
FILTER(regex(?subject, “A.*”)
reference: http://www.regular-expressions.info/reference.html
email me if you need more help – i´m inclined to think that a scraper would be a better solution.
And look into Google Refine: http://code.google.com/p/google-refine/ it is able to reconcile data agains dbpedia: http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/
Hey – thanks for the comment.
I’ll give that alteration to the regex a go!
I’ve been told about Google Refine … It’s now definitely on my list of resources, the next time I’m in MA mode.
Thanks again – I’ll be in touch if I need more help!
Your filter isn’t working because ?subject is a URI, so it always starts with “http://”, not an artist name. But you don’t need to filter at all, just use ORDER BY, LIMIT and OFFSET, increasing OFFSET by 1000 with each iteration. So first you’d do:
SELECT ?subject ?genre WHERE {
?subject rdf:type .
?subject ?genre.
} ORDER BY ?subject LIMIT 1000
then
SELECT ?subject ?genre WHERE {
?subject rdf:type .
?subject ?genre.
} ORDER BY ?subject LIMIT 1000 OFFSET 1000
then
SELECT ?subject ?genre WHERE {
?subject rdf:type .
?subject ?genre.
} ORDER BY ?subject LIMIT 1000 OFFSET 2000
etc.
Hey Glenn …
Thanks for this … I’ll have a go at that .. I knew I was doing something wrong but as was cobbling bits together, I wasn’t quite sure WHAT I was doing …
Caroline