Skip to content

Finding Linksets among Linked Data Bubbles

timrdf edited this page Jan 7, 2013 · 18 revisions

What is first

What we will cover

This page describes how to calculate VoID Linksets between a csv2rdf4lod node and all other bubbles in the Linked Data Diagram, using csv2rdf4lod-automations' one-click data dump and lodcloud's "namespace" annotations. Calculating the Linksets makes it easier to find out how a bubble is connected to others, which also makes it easier to assert the CKAN lodcloud annotation required to get into the diagram.

Let's get to it!

To find links, we need two things:

We can get a bubble's namespace by POSTing its URI to a deployed instance of lift-ckan.py (e.g. here), which provides a good RDF description of the contorted annotations in the CKAN data entry.

curl -H "Content-Type: text/turtle" \
  -d '<http://datahub.io/dataset/2000-us-census-rdf> a <http://purl.org/twc/vocab/datafaqs#CKANDataset> .' \
    http://aquarius.tw.rpi.edu/projects/datafaqs/services/sadi/ckan/lift-ckan

returns the following RDF triples (among others). The one we need is void:uriSpace.

<http://datahub.io/dataset/2000-us-census-rdf> a datafaqs:CKANDataset;
    ov:shortName "US Census (rdfabout)";
    dcterms:title "2000 U.S. Census in RDF (rdfabout.com)";
    void:sparqlEndpoint <http://www.rdfabout.com/sparql>;
    void:triples 1002848918;
    void:uriSpace "http://www.rdfabout.com/rdf/usgov/geo/" .

Modeling the Linkset

When 50 URIs occur in both http://datahub.io/dataset/twc-healthdata and http://datahub.io/dataset/2000-us-census-rdf, it is represented in VoID like this:

<http://datahub.io/dataset/twc-healthdata>
    void:subset [
        a void:Linkset;
        void:target 
          <http://datahub.io/dataset/twc-healthdata>, 
          <http://datahub.io/dataset/2000-us-census-rdf>;
        void:triples 50;
    ], 
.

Limitations of this approach

This is cheaper to calculate because we don't need to go through the hassle of finding and retrieving the full data dump of each bubble, and we don't have as much instance data to process. However, this will miss connections between our bubble and others' when they mention the same URIs that we do, but are not in their own namespace.

What is next?

  • How hard is it to get one click data dumps for bubbles that do not use csv2rdf4lod-automation?
  • What is the disparity between the manual assertion on the CKAN entry and what was actually found?
  • How can we model the Linkset calculation so that it naturally provides justification for the resulting CKAN annotation? (SIO-qualifying the void:triples triple and saying it prov:wasDerivedFrom the analysis that produced it. Tie into Jim's aggregation thesis?)
  • Some thoughts on How to characterize a list of RDF node URIs
  • CKAN lodcloud RDF vocabulary to use add-metadata.py to submit the Linksets to CKAN.

Clone this wiki locally