Skip to content

Dataset datahub io lodcloud group

Tim L edited this page Jul 6, 2014 · 43 revisions

(supporting Survey 3 2014 Jul 04)

Against http://lodcloud.tw.rpi.edu/sparql

version/2014-07-04

http://datahub.io/group/lodcloud says 283 or 214 datasets, depending on if you look to the left or in the middle...

294 typed as datafaqs:CKANDataset, 240 typed as void:Dataset

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select count(distinct ?dataset)
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
      ?dataset                    # <http://thedatahub.org/dataset/DBpedia>
         a datafaqs:CKANDataset
   }
}

Only 49 datasets return when we ask for their void:Linkset connectivity...

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select count(distinct ?dataset)
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
      ?dataset                    # <http://thedatahub.org/dataset/DBpedia>
         a datafaqs:CKANDataset;
         void:subset ?linkset 
      .
      optional{ ?dataset tag:taggedWithTag ?tag }
      optional{ ?dataset void:triples      ?triples }

      ?linkset                    # <http://instances.tw.rpi.edu/id/linkset/DBpedia/e977476546bf11f68176d67246280e63>
         void:target  ?target;    # <http://thedatahub.org/dataset/aemet>
         void:triples ?overlap    # 82
      .
      ?target a datafaqs:CKANDataset .
      filter(?dataset != ?target)
   }
}

Relax it, and we get 294 again (with the void:Linkset when it's there...)

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select count(distinct ?dataset)
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
      ?dataset                    # <http://thedatahub.org/dataset/DBpedia>
         a datafaqs:CKANDataset
      .
      optional{ ?dataset tag:taggedWithTag ?tag }
      optional{ ?dataset void:triples      ?triples }

      optional {
         ?dataset void:subset ?linkset .
         ?linkset                 # <http://instances.tw.rpi.edu/id/linkset/DBpedia/e977476546bf11f68176d67246280e63>
            void:target  ?target; # <http://thedatahub.org/dataset/aemet>
            void:triples ?overlap # 82
         .
         ?target a datafaqs:CKANDataset .
         filter(?dataset != ?target)
      }
   }
}

Show the datasets, how big they are, and how much they overlap with another (here). Order by overlap, so that those without the overlap appear at the bottom.

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select ?dataset ?triples ?overlap ?target
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
      ?dataset                    # <http://thedatahub.org/dataset/DBpedia>
         a datafaqs:CKANDataset
      .
      optional{ ?dataset tag:taggedWithTag ?tag }
      optional{ ?dataset void:triples      ?triples }

      optional {
         ?dataset void:subset ?linkset .
         ?linkset                 # <http://instances.tw.rpi.edu/id/linkset/DBpedia/e977476546bf11f68176d67246280e63>
            void:target  ?target; # <http://thedatahub.org/dataset/aemet>
            void:triples ?overlap # 82
         .
         ?target a datafaqs:CKANDataset .
         filter(?dataset != ?target)
      }
   }
}
order by desc(?overlap) desc(?triples) 

Skimming down that list of those with no overlaps, the first few do not actually claim overlaps. The first one that does is http://datahub.io/dataset/ub-mannheim-linked-data. http://thedatahub.org/dataset/taxonconcept does, too. SO DOES http://datahub.io/dataset/dbpedia :-/

Inconsistent domain names for datahub

A major source of the problem is probably that the domains names for datahub are inconsistent (http://datahub.io vs. http://thedatahub.org).

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select ?n
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
     {?n ?p []} union 
     {[] ?p ?n}
     filter(regex(str(?n),'ub-mannheim-linked-data'))
   }
}
order by ?n

Ugh, let's hack it:

lodcloud@lodcloud:~/prizms/lodcloud/data/source/datahub-io/lodcloud-group/version/2014-07-04$ rdf2nt.sh source/* > manual/sources.nt

perl -pi -e 's|http://thedatahub.org|http://datahub.io|g' manual/sources.nt 

cr-publish.sh manual/sources.nt

Still missing the void:Linksets

The hack above consolidates all of the domains, and all of the queries above return the same results. But the overlaps are still missing...

We need to walk the processing, since we know the further upstream source has the information. Reviewing FAqT Brick and the directory convention diagram that it offers, we can see what DataFAQs accumulated about http://datahub.io/dataset/dbpedia within the faqt brick /home/lodcloud/prizms/lodcloud/data/source/datahub-io/lodcloud-group/version/faqt-brick by looking at:

ls __PIVOT_epoch/2014-07-04/__PIVOT_dataset/thedatahub.org/dataset/dbpedia | less
augmentation-1.rdf
dataset.ttl
get-augmentation-1.sh
get-reference-0.sh
get-reference-1.sh
get-references-1.sh
post.meta.ttl
post.nt
post.nt.rdf
post.nt.sd_name
post.nt.ttl
reference-0.rdf
reference-1.rdf
references-1.ttl
references.csv

reference-0.rdf from curl -s -L -H "Accept: application/rdf+xml, text/rdf;q=0.6, */*;q=0.1" http://thedatahub.org/dataset/dbpedia returns the un-interpreted tag, not the VoID:

        <dct:relation>
          <rdf:Description>
            <rdfs:label>links:2000-us-census-rdf</rdfs:label>
            <rdf:value>12529</rdf:value>
          </rdf:Description>
        </dct:relation>

reference-1.rdf from grabbing the VoID file that they list curl -s -L -H "Accept: application/rdf+xml, text/rdf;q=0.6, */*;q=0.1" http://dbpedia.org/void/Dataset does not mention '2000-us-census-rdf' (one of the datasets that it overlaps with).

references-1.ttl from the SADI service curl -s -H 'Content-Type: text/turtle' -d @dataset.ttl http://aquarius.tw.rpi.edu/projects/datafaqs/services/sadi/core/augment-datasets/with-preferred-uri-and-ckan-meta-void is pointing us to the their VoID datafile (which we just looked at).

@prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://thedatahub.org/dataset/dbpedia> a datafaqs:WithReferences;
    rdfs:seeAlso <http://dbpedia.org/void/Dataset> .

version/2014-07-06

After updating some of the FAqT services to handle the datahub domain name switch (thedatahub.org -> datahub.io), epoch 2014-07-06 now has the VoID Linkset from __PIVOT_epoch/2014-07-06/__PIVOT_dataset/datahub.io/dataset/dbpedia/augmentation-1.rdf / curl -s -H 'Content-Type: application/rdf+xml' -d @post.nt.rdf http://lodcloud.tw.rpi.edu/sadi-services/lift-ckan

        <void:subset>
          <void:Linkset rdf:about="http://instances.tw.rpi.edu/id/linkset/2000-us-census-rdf/62a36c3141e989e1327191d9ddf0ffc2">
            <void:target rdf:resource="http://datahub.io/dataset/dbpedia"/>
            <void:target>
              <datafaqs:CKANDataset rdf:about="http://datahub.io/dataset/2000-us-census-rdf"/>
            </void:target>
            <void:triples rdf:datatype="http://www.w3.org/2001/XMLSchema#long"
            >12529</void:triples>
          </void:Linkset>
        </void:subset>

redoing queries above

Redoing the queries we did against 2014-07-04 again against the new version 2014-07-06 that has better VoID connectivity descriptions.

343 (was 294) typed as datafaqs:CKANDataset, 223 (was 240) typed as void:Dataset

Only 239 (was 49) datasets return when we ask for their void:Linkset connectivity...

Relax it, and we get 343 (was 294) again (with the void:Linkset when it's there...)

Show the datasets, how big they are, and how much they overlap with another (here). Order by overlap, so that those without the overlap appear at the bottom. No datasets without overlaps, so we're good!

What is next

Clone this wiki locally