Skip to content

Dataset datahub io lodcloud group

Tim L edited this page Jul 5, 2014 · 43 revisions

(supporting Survey 3 2014 Jul 04)

Against http://lodcloud.tw.rpi.edu/sparql

http://datahub.io/group/lodcloud says 283 or 214 datasets, depending on if you look to the left or in the middle...

294 typed as datafaqs:CKANDataset, 240 typed as void:Dataset

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select count(distinct ?dataset)
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
      ?dataset                    # <http://thedatahub.org/dataset/DBpedia>
         a datafaqs:CKANDataset
   }
}

Only 49 datasets return when we ask for their void:Linkset connectivity...

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select count(distinct ?dataset)
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
      ?dataset                    # <http://thedatahub.org/dataset/DBpedia>
         a datafaqs:CKANDataset;
         void:subset ?linkset 
      .
      optional{ ?dataset tag:taggedWithTag ?tag }
      optional{ ?dataset void:triples      ?triples }

      ?linkset                    # <http://instances.tw.rpi.edu/id/linkset/DBpedia/e977476546bf11f68176d67246280e63>
         void:target  ?target;    # <http://thedatahub.org/dataset/aemet>
         void:triples ?overlap    # 82
      .
      ?target a datafaqs:CKANDataset .
      filter(?dataset != ?target)
   }
}

Relax it, and we get 294 again (with the void:Linkset when it's there...)

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select count(distinct ?dataset)
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
      ?dataset                    # <http://thedatahub.org/dataset/DBpedia>
         a datafaqs:CKANDataset
      .
      optional{ ?dataset tag:taggedWithTag ?tag }
      optional{ ?dataset void:triples      ?triples }

      optional {
         ?dataset void:subset ?linkset .
         ?linkset                 # <http://instances.tw.rpi.edu/id/linkset/DBpedia/e977476546bf11f68176d67246280e63>
            void:target  ?target; # <http://thedatahub.org/dataset/aemet>
            void:triples ?overlap # 82
         .
         ?target a datafaqs:CKANDataset .
         filter(?dataset != ?target)
      }
   }
}

Show the datasets, how big they are, and how much they overlap with another. Order by overlap, so that those without the overlap appear at the bottom.

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select ?dataset ?triples ?overlap ?target
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
      ?dataset                    # <http://thedatahub.org/dataset/DBpedia>
         a datafaqs:CKANDataset
      .
      optional{ ?dataset tag:taggedWithTag ?tag }
      optional{ ?dataset void:triples      ?triples }

      optional {
         ?dataset void:subset ?linkset .
         ?linkset                 # <http://instances.tw.rpi.edu/id/linkset/DBpedia/e977476546bf11f68176d67246280e63>
            void:target  ?target; # <http://thedatahub.org/dataset/aemet>
            void:triples ?overlap # 82
         .
         ?target a datafaqs:CKANDataset .
         filter(?dataset != ?target)
      }
   }
}
order by desc(?overlap) desc(?triples) 

Skimming down that list of those with no overlaps, the first few do not actually claim overlaps. The first one that does is http://datahub.io/dataset/ub-mannheim-linked-data. http://thedatahub.org/dataset/taxonconcept does, too.

A major source of the problem is probably that the domains names for datahub are inconsistent (http://datahub.io vs. http://thedatahub.org).

prefix void:     <http://rdfs.org/ns/void#>
prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#>
prefix tag:      <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

select ?n
where {
   graph <http://purl.org/twc/lodcloud/source/datahub-io/dataset/lodcloud-group/version/2014-07-04> {
     {?n ?p []} union 
     {[] ?p ?n}
     filter(regex(str(?n),'ub-mannheim-linked-data'))
   }
}
order by ?n

Ugh, let's hack it:

lodcloud@lodcloud:~/prizms/lodcloud/data/source/datahub-io/lodcloud-group/version/2014-07-04$ rdf2nt.sh source/* > manual/sources.nt

perl -pi -e 's|http://thedatahub.org|http://datahub.io|g' manual/sources.nt 

cr-publish.sh manual/sources.nt

Clone this wiki locally