In this blog post I talked about the potential of (Ordnance Survey) linked data. Partly motivated by this challenge I decided to write up how I did the mash up of data.gov.uk data and Ordnance Survey linked data. This post is a slightly different take on a previous post.
For this mashup I used Python 2.7 and rdflib 3.0.0.
First off you need to install rdflib. Full instructions on doing this can be found here. If you use easy_install you can install rdflib by typing:
easy_install -U "rdflib>=3.0.0"
You will also need to install rdfextras (see here). This can also be done using easy_install
You are now good to go. The next thing I needed was the BIS funding data. This can be downloaded here. The original BIS data gives location for various organisations via a URI based on the organisation’s postcode. For example:
I edited the data to point to URIs for postcodes in the Ordnance Survey linked data (note these weren’t available when the BIS data was created). Now we have:
This triple basically states the location of the University of Wales in terms of its postcode.
So the edited RDF data now contains location information for research institutions in terms of a postcode URI, and it also contains information about the research projects worked on by those institutions and how much funding those projects received. Using rdflib it is very straight forward to load this data into Python and use it programmatically. Here’s how:
These first few lines load the necessary libraries and plugins:
# Configure how we want rdflib logger to log messages
_logger = logging.getLogger("rdflib")
_logger.setLevel(logging.DEBUG)_hdlr = logging.StreamHandler()
_hdlr.setFormatter(logging.Formatter('%(name)s %(levelname)s: %(message)s'))
from rdflib import Graph
from rdflib import URIRef, Literal, BNode, Namespace, ConjunctiveGraph
from rdflib import RDF
from rdflib import RDFS
rdflib.plugin.register('sparql', rdflib.query.Processor,'rdfextras.sparql.processor', 'Processor')
rdflib.plugin.register('sparql', rdflib.query.Result, 'rdfextras.sparql.query', 'SPARQLQueryResult')
we now create a Graph in which to store the RDF:
store = Graph()
the data can be easily loaded from the web or hard drive. In this case I have the files stored locally:
Recall from here that I am interested in seeing which parties are funding in which local authority areas. The data as it stands will not let me do this. However, the OS postcode linked data provides information about the local authority areas that a postcode is contained in. All I now have to do is ‘follow my nose’ and load in the postcode data. I can do this by going through the triples containing links between organisations and postcodes via the location property. First I set up a few namespace bindings:
# Bind a few prefix, namespace pairs.
# Create a namespace object for the project and FOAF namespaces.
PROJECT = Namespace("http://research.data.gov.uk/def/project/")
FOAF = Namespace("http://xmlns.com/foaf/0.1/")
I can now iterate over the triples in the store and find those who subject is a type of foaf:Organization, and which contain the location property. An example of such a triple would be the one we had above:
I can then lookup the data behind the postcode URI and load this into the store. This is all done by the following code:
# For each foaf:Organization in the store get the postcode
for organization in store.subjects(RDF.type, FOAF["Organization"]):
for postcode in store.objects(organization, PROJECT["location"]):
print '404 not found'
Now the data in the store will contain a link from organisation to postocde, and a link from postcode to local authority area. We can now traverse the graph to find the link from organisation to local authority area. We can now use a simple SPARQL query to retrieve a list of projects giving the local authority areas the participating organisations are based in. The SPARQL query to do this is:
select distinct ?label ?districtlabel
?organisation <http://research.data.gov.uk/def/project/project> ?project .
?project <http://www.w3.org/2000/01/rdf-schema#label> ?label .
?organisation <http://research.data.gov.uk/def/project/location> ?x .
?x <http://data.ordnancesurvey.co.uk/ontology/postcode/district> ?district .
?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtlabel . }
We can now add that into our Python code as follows and print out the query answers:
query = """select distinct ?label ?districtlabel \
?organisation <http://research.data.gov.uk/def/project/project> ?project .\
?project <http://www.w3.org/2000/01/rdf-schema#label> ?label . \
?organisation <http://research.data.gov.uk/def/project/location> ?x . \
?x <http://data.ordnancesurvey.co.uk/ontology/postcode/district> ?district . \
?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtlabel . }"""
answers = store.query(query).serialize('python')
for (label,districtlabel) in answers:
print "%s was funded in %s" % (label,districtlabel)
To summarise, this post shows how you just need rdflib and Python to build a simple linked data mashup – no separate triplestore is required! RDF is loaded into a Graph. Triples in this Graph reference postcode URIs. These URIs are de-referenced and the RDF behind them is loaded into the Graph. We have now enhanced the data in the Graph with local authority area information. So as well as knowing the postcode of the organisations taking part in certain projects we now also know which local authority area they are in. Job done! We can now analyse funding data at the level of postcode, local authority area and (as an exercise for the ready) European region.
[Python note – WordPress keeps messing with my indentation and I’m too tired to fix. I hope that doesn’t detract from your enjoyment of this blog post :)]