Skip to content

Ermilov's wiki.publicdata.eu CSV2RDF Application

Tim L edited this page May 13, 2013 · 43 revisions

What is first

What we will cover

Ermilov et al. presented a wiki-based approach to crowd-sourcing the enhancements of ~9k datasets listed at http://publicdata.eu (WebSci 2012 paper).

A year after its publication, how far has the crowd-sourcing come?

This pages provides a summary and review of Ermilov's wiki.publicdata.eu CSV2RDF Application.

Let's get to it

How many people contributed to the "crowd-source" enhancement?

Four accounts contributed, and the two non-author accounts provided fewer than ten contributions.

find manual/pages -name "*.ttl" | xargs -L1 grep "wasAttributedTo" | sort -u shows only a handful of contributors:

      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:178.25.43.32>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:2001:638:902:2010:0:168:35:101>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:Iermilov>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:IvanErmilov>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:Soeren>;

How many datasets are covered?

2,035 of the 19,000 are mentioned in their mapping wiki.

find manual/pages -name "*.ttl" | xargs grep -h -B1 "datafaqs:CKANDataset" | grep -v "^--" | grep -v datafaqs:CKANDataset | sort -u | wc -l verified by SPARQL query:

PREFIX datafaqs: <http://purl.org/twc/vocab/datafaqs#>
PREFIX dcat:     <http://www.w3.org/ns/dcat#>

SELECT ?dataset
WHERE {
  ?dataset a datafaqs:CKANDataset, dcat:Dataset .
}
<http://publicdata.eu/dataset/-municipal-waste-generation-in-england-from-2000-01-to-2009-10>
<http://publicdata.eu/dataset/01-bve-adressen-instellingen--ministerie-van-ocw>
<http://publicdata.eu/dataset/01-bve-crebo--beroepsopleidingen-ministerie-van-ocw-2010-2011-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-bve-deelnemers-per-instelling-en-type-mbo-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-ho-adressen-hbo-instellingen-en-universiteiten--ministerie-van-ocw>
<http://publicdata.eu/dataset/01-po-hoofdvestigingen-basisonderwijs-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-po-leerlingen-basisonderwijs-naar-gewicht-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-vo-adressen-hoofdvestigingen--ministerie-van-ocw>
...
<http://publicdata.eu/dataset/years-of-life-lost-due-to-suicide>
<http://publicdata.eu/dataset/young-first-time-offenders-borough>
<http://publicdata.eu/dataset/ypla_financial_transactions_december>
<http://publicdata.eu/dataset/ypla_financial_transactions_november>
<http://publicdata.eu/dataset/ypla_financial_transactions_october>
<http://publicdata.eu/dataset/zuzuge>

How many existing vocabulary terms did the crowd-sourced enhancement produce?

Fifteen terms were reused from nine vocabularies for more than 9,000 datasets. We skip the three non-CURIEs listed below because it is not clear that they are RDF terms.

find manual/pages -name "*.xml.ttl" | xargs -L1 grep "conversion:label" | sed 's/conversion:label//' | grep : | sed 's/^ *"/"/' | grep -v " " | sort -u:

"cgov:fullTimeEquivalentSalary";
"cgov:lowerBound";
"cgov:upperBound";
"dce:date";
"foaf:mbox";
"foaf:name";
"foaf:phone";
"http://dbpedia.org/resource/Category:Ministerial_departments_of_the_United_Kingdom_Government";
"http://statistics.data.gov.uk/id/local-authority/32UC";
"http://www.google.co.uk";
"org:OrganizationalUnit";
"org:organization";
"org:unitOf";
"pc:supplier";
"rdf:type";
"rdfs:comment";
"skos:Amount";
"whois:Job";

Benefits

  • http://publicdata.eu aggregates from many other European-based CKAN instances.
  • Enables community-editable mappings using an existing mechanism (wikimedia).
  • The main CKAN dataset listing site links to the mapping wiki.
  • User-invokable reconversion.

Shortcomings

Usability Shortcomings:

Linked Data Best Practices Shortcomings:

  • curl -H "Accept: application/rdf+xml" -L http://publicdata.eu/dataset/directgov-referring-sites returns a gzipped HTML file (appending .rdf works, though: http://publicdata.eu/dataset/directgov-referring-sites.rdf).
  • The mappings are NOT expressed in RDF; they are only expressed as mediawiki template arguments (and sparqlify behind the scenes, but they aren't available for public inspection). Although the intent is to make them easy to read/write for a novice, that does not mean that they shouldn't be lifted behind the scenes and made available as RDF for other systems to use.
  • The mappings are NOT described with RDF, since it's just a wiki page (The Special:Export can be used, but it's not findable from the page itself using linked data principles). The mapping description does NOT refer back to the dataset that they enhance [using RDF], and they do NOT refer to the resulting RDF conversion [using RDF].
  • The namespace used (http://wiki.publicdata.eu/ontology/) for the RDF properties 404s.
  • The site for the converter tool (http://sparqlify.org/wiki/Main_Page) 404s.
  • The RDF conversion dump files use the NTriples serialization but have the extension .rdf (which is generally reserved for application/rdf+xml serialization). (e.g. http://csv2rdf.aksw.org/sparqlified/f449751c-68d3-4f84-8fe3-5c3a4cb86c84_default-tranformation-configuration.rdf). This confuses even the best-of-breed RDF serialization tools.

Mapping Capabilities Shortcomings:

  • It can't specify a datatype for a cell's value like conversion:range does (e.g. ""85" is an xsd:integer).
  • It can't "promote" a cell value to a URI like conversion:range does (e.g. "http://www.google.co.nz" becomes <http://www.google.co.nz>).
  • It can't type a URI to a given class like conversion:range_template/conversion:subclass_of do (e.g. <http://www.google.co.nz> is a sioc:Space).
  • It's property creation strategy (put everything into http://wiki.publicdata.eu/ontology/) is not conservative enough and fosters collisions. csv2rdf4lod uses a hierarchical naming based on the publishing organization, the dataset, and the version of the dataset (the so-called "[SDV naming](Conversion process phase: name)") to avoid terminology collisions while facilitating natural and incremental dataset integration.

Provenance and Metadata Shortcomings:

  • (to be enumerated)
  • Can we trust the aggregation that http://publicdata.eu does from the many other European-based CKAN instances? Or, do we have to redo it to get better results?

Clone this wiki locally