Secondary Derivative Datasets

What is first

cr-cron.sh automates much of the construction of secondary derived datasets.
Aggregating subsets of converted datasets is a less-informative sibling to producing secondary derivative datasets, since that just repackages existing data/metadata instead of deriving novel information from the existing data.
The need and broad applicability of secondary derived datasets was annealed during twc-healthdata.

What we will cover

Let's get to it!

Enabling

pr-enable-dataset.sh provides the ability to enable any of of the built-in secondary derived datasets that come with Prizms. Running it from within the conversion [data root](csv2rdf4lod automation data root) without any parameters shows the status for all available derived datasets.

lebot@hub:~/prizms/hub/data/source$ cr-pwd.sh 
source/

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh 
Available datasets:
   pr-spobal-ng           is *not* enabled at hub/pr-spobal-ng/version/latest/retrieve.sh (/home/lebot/opt/prizms/bin/dataset/pr-spobal-ng.sh)
   cr-aggregate-eparams   is *not* enabled at hub/cr-aggregate-eparams/version/latest/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh)

pr-enable-dataset.sh leverages csv2rdf4lod-automation's Triggers mechanism to derive secondary datasets using the same mechanisms that are used to process a single dataset. pr-enable-dataset.sh enable a derived datasets by inserting a trigger within the appropriate "SDV" directory conventions.

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh --as-latest cr-aggregate-eparams
Created hub/cr-aggregate-eparams/version/latest/retrieve.sh -> /home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh

Rerunning the overview shows that cr-aggregate-eparams is now enabled.

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh 
Available datasets:
   pr-spobal-ng           is *not* enabled at hub/pr-spobal-ng/version/latest/retrieve.sh (/home/lebot/opt/prizms/bin/dataset/pr-spobal-ng.sh)
   cr-aggregate-eparams   is enabled at hub/cr-aggregate-eparams/version/latest/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh)

Now commit the pointers, so that the production user can find them.

lebot@hub:~/prizms/hub/data/source/hub$ git add -f pr-neighborlod/src pr-neighborlod/version/retrieve.sh 
lebot@hub:~/prizms/hub/data/source/hub$ git commit -m 'enabled pr-neighborlod'

CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID

Derived datasets are created with a source identifier for "us". CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID is a [CSV2RDF4LOD environment variable](CSV2RDF4LOD environment variables) used to indicate our source identifier.

Avoiding redundant versions with the version "latest"

Provide the argument, situate the trigger in the version/latest instead of just in version/.

Transporting enabled datasets across Prizms nodes

If you mirror a Prizms node, the soft link that pr-enable-dataset.sh creates and becomes version controlled will likely break. Fortunately, Prizms will be able to recognize this inconsistency and use the naming convention to automatically fix the reference.

Adding a secondary derived dataset

See below.

What's available

Aggregated portions of other datasets

Some of the automated datasets only aggregate useful subsets of existing datasets -- they don't derive new information but simply repackage what exists. See Aggregating subsets of converted datasets for coverage on:

Aggregating DCAT metadata,
Aggregating DROID file metadata,
Aggregating Datasets' Conversion Metadata,
Aggregating owl:sameAs links,
Aggregating MetaDatasets,
Aggregating rdfs:isDefinedBy,
Aggregating Turtle-in-comments,
Aggregating a full dump,
Provenance and metadata created from retrieval, tweaking, conversion, and aggregation, and
Sitemaps.

cr-isdefinedby

cr-isdefinedby.sh gathers up all predicates and classes occurring in a Prizms node and asserts rdfs:isDefinedBy its namespace and prov:wasAttributedTo its domain. This is used in the web site to organize terminology that occurs in the data. Find all asserted properties and classes, and assert rdfs:isDefinedBy to their namespace.

This dataset is incremental, and thus should be enabled at the version level (not as a "latest").

cr-linksets

cr-linksets gathers up all URIs in a Prizms node that are outside of its namespaces to find those that fall within a LOD Cloud Diagram bubble. See Finding Linksets among Linked Data Bubbles

This dataset recalculates each time. If it is 'latest', a history will not be kept for how this Prizms node became a more integrated part of the rest of the LOD Cloud. If it is "versioned", you will be able to observe this growth.

cr-sitemap

cr-sitemap.sh produces a sitemap for robots.txt, so that automated agents can navigate the Prizms node data site. See Sindice at Ping the Semantic Web.

pr-spobal-ng

Deriving SPO Balance

pr-spobal-ng calculates SPO Balance for every named graph in a Prizms node's SPARQL endpoint.
pr-spobal-full-dump calculates SPO Balance for the Prizms node's full RDF dump.

pr-neighborlod

pr-neighborlod gathers up all URIs in a Prizms node that are outside of its namespaces, associates it to the URI's domain, and accumulates its Linked Data dereferenced RDF.

A more elaborate analysis would record whether the external URI was dereferenceable as RDF:

:new prov:specOf :external; 
     dcterms:date "2013-09-13T12:28:31+00:00"^^xsd:date;
     a NotDereferencable .

cr-pingback

cr-pingback.sh See Ping the Semantic Web.

In the works

The following datasets have been created for special applications and need to be generalized to suit any Prizms node.

those in cr-cron.sh
- cr-mirror-ckan.py (hasn't been exercised beyond healthdata use case; haven't had much need for it since)
- cr-publish-tic-to-endpoint.sh (needs a more efficient grep)
- cr-linksets.sh (works fine within cron; don't fix what's not broken)
- cr-pingback.sh (works fine within cron; don't fix what's not broken)
vcard address -> lat/long cr-address-coordinates.sh
raw analysis
https://github.com/timrdf/rdfstats/wiki
AlchemyAPI of non-RDF linked data URIs, with dcterms:related for the pointers that alchemy provides. (these need attribution to alchemy)
fixing the /sparql localhost problem

Deriving Between The Edge (BTE) Descriptions

https://github.com/timrdf/vsr/wiki/Characterizing-a-list-of-RDF-node-URIs#bte-vocabulary

This is done for specifically SVN paths in opendap.tw. It doesn't make sense to explode the BTE for the entire Prizms node, so we need to figure out a good general case subset of URIs to process.

SPARQL CONSTRUCTs

SPARQL constructs ala WCL property chain use case.

pr-sparql-log

An RDF dataset derived from grep 'GET /sparql' /var/log/apache2/access.log. This isn't implemented yet, but inspired by Mariano Rico's email to dbpedia. Some privacy concerns here...

Questions we could answer:

What clients are accessing my endpoint? (e.g. LODSPeaKr version 20130612, http://sparqles.okfn.org/, sindice?)

I'm really not in the mood to dig into parsing apache logs...

http://www.leancrew.com/all-this/2013/07/parsing-my-apache-logs/ provides some python

What is the access.log pattern? My directive is CustomLog /var/log/apache2/access.log combined which follows log pattern "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"", with meanings:

%h Remote host
%l Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On.
%u Remote user (from auth; may be bogus if return status (%s) is 401)
%t Time the request was received (standard english format)
\"%r\" First line of request
%>s Status. For requests that got internally redirected, this is the status of the original request --- %>s for the last.
%b Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.
\"%{Referer}i\" The contents of Foobar: header line(s) in the request sent to the server.
\"%{User-agent}i\" The contents of Foobar: header line(s) in the request sent to the server.

An example log message:

192.168.1.62 - - [10/Jan/2014:02:33:11 +0000] "GET /sparql?show_inline=0&named_graph=&output=rdf&query=... HTTP/1.1" 200 468 "-" "LODSPeaKr version 20130612"

Implementing a derived dataset

pr-enable-dataset.sh offers to install any retrieval trigger that it finds by either of the following two methods:

Any file named "pr-*" situated in the same directory as pr-enable-dataset.sh itself. These are the secondary datasets that are bundled with Prizms itself, which are maintained on GitHub here.
Any file annotated as being a #3> <> a conversion:RetrievalTrigger. This allows any Prizms node instance to add their own secondary derived datasets, which could be in turn used by other Prizms nodes.
- csv2rdf4lod follows the bin/dataset/script.sh convention for several of its secondary derived dataset triggers. The optional corresponding folders (e.g. cr-sitemap/ and cr-sitemap.sh) contain supporting materials that the main trigger requires.
- In the future, this path should be generalized with an CSV2RDF4LOD environment variables for which prizms instance roots to search; right now it's hard coded to csv2rdf4lod-automation.

To add your own secondary dataset,

Create a retrieval trigger according to the Triggers and SDV organization conventions.
- Accept arguments [-n] [version-identifier] for dry run and dataset version to use, respectively.
Include the [tic](tic turtle in comments) metadata #3> <> a conversion:RetrievalTrigger in your retrieval trigger.
IF the trigger is smart enough to not repeat itself mindlessly if called repeatedly, use #3> <> a conversion:RetrievalTrigger, conversion:Idempotent;

Once a dataset is enabled, cr-cron.sh uses cr-retrieve.sh to pull any retrieval triggers that are in the data root (with arguments cr-retrieve.sh -w --skip-if-exists) .

Design principles: Monotonicity / Idempotency.

Secondary Derivative Datasets

What is first

What we will cover

Let's get to it!

Enabling

CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID

Avoiding redundant versions with the version "latest"

Transporting enabled datasets across Prizms nodes

Adding a secondary derived dataset

What's available

Aggregated portions of other datasets

cr-isdefinedby

cr-linksets

cr-sitemap

pr-spobal-ng

pr-neighborlod

cr-pingback

In the works

Deriving Between The Edge (BTE) Descriptions

SPARQL CONSTRUCTs

pr-sparql-log

Implementing a derived dataset

What is next

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!