Skip to content

Secondary Derivative Datasets

Tim L edited this page Jan 14, 2014 · 136 revisions

What is first

  • cr-cron.sh automates much of the construction of secondary derived datasets.
  • Aggregating subsets of converted datasets is a less-informative sibling to producing secondary derivative datasets, since that just repackages existing data/metadata instead of deriving novel information from the existing data.
  • The need and broad applicability of secondary derived datasets was annealed during twc-healthdata.

What we will cover

Let's get to it!

Enabling

pr-enable-dataset.sh provides the ability to enable any of of the built-in secondary derived datasets that come with Prizms. Running it from within the conversion [data root](csv2rdf4lod automation data root) without any parameters shows the status for all available derived datasets.

lebot@hub:~/prizms/hub/data/source$ cr-pwd.sh 
source/

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh 
Available datasets:
   pr-spobal-ng           is *not* enabled at hub/pr-spobal-ng/version/latest/retrieve.sh (/home/lebot/opt/prizms/bin/dataset/pr-spobal-ng.sh)
   cr-aggregate-eparams   is *not* enabled at hub/cr-aggregate-eparams/version/latest/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh)

pr-enable-dataset.sh leverages csv2rdf4lod-automation's Triggers mechanism to derive secondary datasets using the same mechanisms that are used to process a single dataset. pr-enable-dataset.sh enable a derived datasets by inserting a trigger within the appropriate "SDV" directory conventions.

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh --as-latest cr-aggregate-eparams
Created hub/cr-aggregate-eparams/version/latest/retrieve.sh -> /home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh

Rerunning the overview shows that cr-aggregate-eparams is now enabled.

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh 
Available datasets:
   pr-spobal-ng           is *not* enabled at hub/pr-spobal-ng/version/latest/retrieve.sh (/home/lebot/opt/prizms/bin/dataset/pr-spobal-ng.sh)
   cr-aggregate-eparams   is enabled at hub/cr-aggregate-eparams/version/latest/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh)

Now commit the pointers, so that the production user can find them.

lebot@hub:~/prizms/hub/data/source/hub$ git add -f pr-neighborlod/src pr-neighborlod/version/retrieve.sh 
lebot@hub:~/prizms/hub/data/source/hub$ git commit -m 'enabled pr-neighborlod'
CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID

Derived datasets are created with a source identifier for "us". CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID is a [CSV2RDF4LOD environment variable](CSV2RDF4LOD environment variables) used to indicate our source identifier.

Avoiding redundant versions with the version "latest"

Provide the argument, situate the trigger in the version/latest instead of just in version/.

Transporting enabled datasets across Prizms nodes

If you mirror a Prizms node, the soft link that pr-enable-dataset.sh creates and becomes version controlled will likely break. Fortunately, Prizms will be able to recognize this inconsistency and use the naming convention to automatically fix the reference.

Adding a secondary derived dataset

See below.

What's available

Aggregated portions of other datasets

Some of the automated datasets only aggregate useful subsets of existing datasets -- they don't derive new information but simply repackage what exists. See Aggregating subsets of converted datasets for coverage on:

  • Aggregating DCAT metadata,
  • Aggregating DROID file metadata,
  • Aggregating Datasets' Conversion Metadata,
  • Aggregating owl:sameAs links,
  • Aggregating MetaDatasets,
  • Aggregating rdfs:isDefinedBy,
  • Aggregating Turtle-in-comments,
  • Aggregating a full dump,
  • Provenance and metadata created from retrieval, tweaking, conversion, and aggregation, and
  • Sitemaps.

cr-isdefinedby

cr-isdefinedby.sh gathers up all predicates and classes occurring in a Prizms node and asserts rdfs:isDefinedBy its namespace and prov:wasAttributedTo its domain. This is used in the web site to organize terminology that occurs in the data. Find all asserted properties and classes, and assert rdfs:isDefinedBy to their namespace.

  • This dataset is incremental, and thus should be enabled at the version level (not as a "latest").

cr-linksets

cr-linksets gathers up all URIs in a Prizms node that are outside of its namespaces to find those that fall within a LOD Cloud Diagram bubble. See Finding Linksets among Linked Data Bubbles

  • This dataset recalculates each time. If it is 'latest', a history will not be kept for how this Prizms node became a more integrated part of the rest of the LOD Cloud. If it is "versioned", you will be able to observe this growth.

cr-sitemap

cr-sitemap.sh produces a sitemap for robots.txt, so that automated agents can navigate the Prizms node data site. See Sindice at Ping the Semantic Web.

pr-spobal-ng

Deriving SPO Balance

pr-neighborlod

pr-neighborlod gathers up all URIs in a Prizms node that are outside of its namespaces, associates it to the URI's domain, and accumulates its Linked Data dereferenced RDF.

A more elaborate analysis would record whether the external URI was dereferenceable as RDF:

:new prov:specOf :external; 
     dcterms:date "2013-09-13T12:28:31+00:00"^^xsd:date;
     a NotDereferencable .

cr-pingback

cr-pingback.sh See Ping the Semantic Web.

In the works

The following datasets have been created for special applications and need to be generalized to suit any Prizms node.

Deriving Between The Edge (BTE) Descriptions

https://github.com/timrdf/vsr/wiki/Characterizing-a-list-of-RDF-node-URIs#bte-vocabulary

This is done for specifically SVN paths in opendap.tw. It doesn't make sense to explode the BTE for the entire Prizms node, so we need to figure out a good general case subset of URIs to process.

SPARQL CONSTRUCTs

  • SPARQL constructs ala WCL property chain use case.

pr-sparql-log

An RDF dataset derived from grep 'GET /sparql' /var/log/apache2/access.log. This isn't implemented yet, but inspired by Mariano Rico's email to dbpedia. Some privacy concerns here...

Questions we could answer:

I'm really not in the mood to dig into parsing apache logs...

What is the access.log pattern? My directive is CustomLog /var/log/apache2/access.log combined which follows log pattern "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"", with meanings:

  • %h Remote host
  • %l Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On.
  • %u Remote user (from auth; may be bogus if return status (%s) is 401)
  • %t Time the request was received (standard english format)
  • \"%r\" First line of request
  • %>s Status. For requests that got internally redirected, this is the status of the original request --- %>s for the last.
  • %b Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.
  • \"%{Referer}i\" The contents of Foobar: header line(s) in the request sent to the server.
  • \"%{User-agent}i\" The contents of Foobar: header line(s) in the request sent to the server.

An example log message:

192.168.1.62 - - [10/Jan/2014:02:33:11 +0000] "GET /sparql?show_inline=0&named_graph=&output=rdf&query=... HTTP/1.1" 200 468 "-" "LODSPeaKr version 20130612"

Implementing a derived dataset

pr-enable-dataset.sh offers to install any retrieval trigger that it finds by either of the following two methods:

To add your own secondary dataset,

  • Create a retrieval trigger according to the Triggers and SDV organization conventions.
    • Accept arguments [-n] [version-identifier] for dry run and dataset version to use, respectively.
  • Include the [tic](tic turtle in comments) metadata #3> <> a conversion:RetrievalTrigger in your retrieval trigger.
  • IF the trigger is smart enough to not repeat itself mindlessly if called repeatedly, use #3> <> a conversion:RetrievalTrigger, conversion:Idempotent;

Once a dataset is enabled, cr-cron.sh uses cr-retrieve.sh to pull any retrieval triggers that are in the data root (with arguments cr-retrieve.sh -w --skip-if-exists) .

Design principles: Monotonicity / Idempotency.

What is next

Clone this wiki locally