Ermilov's wiki.publicdata.eu CSV2RDF Application

What is first

Notes within the alternate converters list

What we will cover

Ermilov et al. presented a wiki-based approach to crowd-sourcing the enhancements of ~9k datasets listed at http://publicdata.eu (WebSci 2012 paper).

A year after its publication, how far has the crowd-sourcing come?

This pages provides a summary and review of Ermilov's wiki.publicdata.eu CSV2RDF Application.

Let's get to it

How many people contributed to the "crowd-source" enhancement?

Four accounts contributed, and the two non-author accounts provided fewer than ten contributions.

find manual/pages -name "*.ttl" | xargs -L1 grep "wasAttributedTo" | sort -u shows only a handful of contributors:

      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:178.25.43.32>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:2001:638:902:2010:0:168:35:101>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:Iermilov>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:IvanErmilov>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:Soeren>;

How many datasets are covered?

2,035 of the 19,000 are mentioned in their mapping wiki.

PREFIX datafaqs: <http://purl.org/twc/vocab/datafaqs#>
PREFIX dcat:     <http://www.w3.org/ns/dcat#>

SELECT ?dataset
WHERE {
  ?dataset a datafaqs:CKANDataset, dcat:Dataset .
}

<http://publicdata.eu/dataset/-municipal-waste-generation-in-england-from-2000-01-to-2009-10>
<http://publicdata.eu/dataset/01-bve-adressen-instellingen--ministerie-van-ocw>
<http://publicdata.eu/dataset/01-bve-crebo--beroepsopleidingen-ministerie-van-ocw-2010-2011-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-bve-deelnemers-per-instelling-en-type-mbo-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-ho-adressen-hbo-instellingen-en-universiteiten--ministerie-van-ocw>
<http://publicdata.eu/dataset/01-po-hoofdvestigingen-basisonderwijs-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-po-leerlingen-basisonderwijs-naar-gewicht-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-vo-adressen-hoofdvestigingen--ministerie-van-ocw>
...
<http://publicdata.eu/dataset/years-of-life-lost-due-to-suicide>
<http://publicdata.eu/dataset/young-first-time-offenders-borough>
<http://publicdata.eu/dataset/ypla_financial_transactions_december>
<http://publicdata.eu/dataset/ypla_financial_transactions_november>
<http://publicdata.eu/dataset/ypla_financial_transactions_october>
<http://publicdata.eu/dataset/zuzuge>

How many existing vocabulary terms did the crowd-sourced enhancement produce?

Fifteen terms were reused from nine vocabularies for more than 9,000 datasets. We skip the three non-CURIEs listed below because it is not clear that they are RDF terms.

"cgov:fullTimeEquivalentSalary";
"cgov:lowerBound";
"cgov:upperBound";
"dce:date";
"foaf:mbox";
"foaf:name";
"foaf:phone";
"http://dbpedia.org/resource/Category:Ministerial_departments_of_the_United_Kingdom_Government";
"http://statistics.data.gov.uk/id/local-authority/32UC";
"http://www.google.co.uk";
"org:OrganizationalUnit";
"org:organization";
"org:unitOf";
"pc:supplier";
"rdf:type";
"rdfs:comment";
"skos:Amount";
"whois:Job";

Benefits

http://publicdata.eu aggregates from many other European-based CKAN instances.
Enables community-editable mappings using an existing mechanism (wikimedia).
The main CKAN dataset listing site links to the mapping wiki.
User-invokable reconversion.

Shortcomings

Usability Shortcomings:

The wiki-page is hard to use because it is disconnected from both the original and resulting data.
The community hasn't used the tool, even though it has been available for a year.
The mapping wiki pages have meaningless names (e.g. http://wiki.publicdata.eu/wiki/Csv2rdf:F449751c-68d3-4f84-8fe3-5c3a4cb86c84).

Linked Data Best Practices Shortcomings:

curl -H "Accept: application/rdf+xml" -L http://publicdata.eu/dataset/directgov-referring-sites returns a gzipped HTML file (appending .rdf works, though: http://publicdata.eu/dataset/directgov-referring-sites.rdf).
The mappings are NOT expressed in RDF; they are only expressed as mediawiki template arguments (and sparqlify behind the scenes, but they aren't available for public inspection). Although the intent is to make them easy to read/write for a novice, that does not mean that they shouldn't be lifted behind the scenes and made available as RDF for other systems to use.
The mappings are NOT described with RDF, since it's just a wiki page (The Special:Export can be used, but it's not findable from the page itself using linked data principles). The mapping description does NOT refer back to the dataset that they enhance [using RDF], and they do NOT refer to the resulting RDF conversion [using RDF].
The namespace used (http://wiki.publicdata.eu/ontology/) for the RDF properties 404s.
The site for the converter tool (http://sparqlify.org/wiki/Main_Page) 404s.
The RDF conversion dump files use the NTriples serialization but have the extension .rdf (which is generally reserved for application/rdf+xml serialization). (e.g. http://csv2rdf.aksw.org/sparqlified/f449751c-68d3-4f84-8fe3-5c3a4cb86c84_default-tranformation-configuration.rdf). This confuses even the best-of-breed RDF serialization tools.

Mapping Capabilities Shortcomings:

It can't specify a datatype for a cell's value like conversion:range does (e.g. ""85" is an xsd:integer).
It can't "promote" a cell value to a URI like conversion:range does (e.g. "http://www.google.co.nz" becomes <http://www.google.co.nz>).
It can't type a URI to a given class like conversion:range_template/conversion:subclass_of do (e.g. <http://www.google.co.nz> is a sioc:Space).
It's property creation strategy (put everything into http://wiki.publicdata.eu/ontology/) is not conservative enough and fosters collisions. csv2rdf4lod uses a hierarchical naming based on the publishing organization, the dataset, and the version of the dataset (the so-called "[SDV naming](Conversion process phase: name)") to avoid terminology collisions while facilitating natural and incremental dataset integration.

Provenance and Metadata Shortcomings:

(to be enumerated)
Can we trust the aggregation that http://publicdata.eu does from the many other European-based CKAN instances? Or, do we have to redo it to get better results?

Ermilov's wiki.publicdata.eu CSV2RDF Application

What is first

What we will cover

Let's get to it

How many people contributed to the "crowd-source" enhancement?

How many datasets are covered?

How many existing vocabulary terms did the crowd-sourced enhancement produce?

Benefits

Shortcomings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!