Skip to content

Publishing LOGD's International Open Government Data Search data

timrdf edited this page Oct 3, 2011 · 91 revisions

The International Open Government Data Search (IOGDS) project developed their own publishing scripts instead of using the publishing scripts provided by csv2rdf4lod-automation.

This page is a collection of notes about what they needed that csv2rdf4lod-automation didn't provide, so that we can incorporate it back into the rest of the core automation. The issues that IOGDS has raised against csv2rdf4lod-automation are listed here.

These are notes on reverse engineering the code that is lying around - 
it is not intended to be an authoritative explanation of how IOGDS was constructed. 
It is scraps of evidence put together by an outsider.

(http://logd.tw.rpi.edu/lab/project/logd_internaltional_ogd_catalog some documentation)

What they did use

IOGDS used the directory conventions of the [data root](csv2rdf4lod automation data root), they used the enhancement parameters to specify how to transform their CSV scrapings to RDF, and they used the conversion trigger to invoke the core converter. This got them to the point of having per-file RDF conversion results in manual/ and their aggregations in publish/ (for all 80ish of their datasets).

What they did not use

IOGDS did not use the conversion cockpits' publish/bin/publish.sh to publish into the converted RDF into named graphs named after the VoID datasets' URIs, nor did thet use the Metadataset conventions described in Aggregating subsets of converted datasets (#237 and #238). Instead, they created a stand-alone php that they placed in the [data root](csv2rdf4lod automation data root): source/logd-iogdc-exec.php.

So, the step that they recreated on their own:

 conversion results on disk -> conversion results in triple store named graph

Which they achieved by running the following commands:

gemini$ cd /work/data-gov/csv2rdf4lod-automation/data/source
gemini$ ./logd-iogdc-exec.php load

When invoking the above, the following parameters are set:

    [time-start]                   => 1317517576
    [time]                         => 2011-10-01T21:06:16-04:00
    [dir-pwd]                      => /mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source
    [dir-temp]                     => /tmp/data-gov/iogdc-dump-ttl
    [filename-temp-all]            => iogdc-dump-all.tar.gz
    [uri-base]                     => http://logd.tw.rpi.edu
    [enhancement-id]               => 1
    [uri-metadata-graph]           => http://purl.org/twc/vocab/conversion/MetaDataset
    [uri-metadata-graph-test]      => http://purl.org/twc/vocab/conversion/MetaDataset-test
    [filename-metadata-graph]      => /tmp/data-gov/iogdc-dump-ttl/metadata-graph.ttl
    [filename-metadata-graph-test] => /tmp/data-gov/iogdc-dump-ttl/metadata-graph-test.ttl
    [namespace-dgtwc]              => http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#
    [uri-metadata-logd]            => http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#metadata-logd
    [uri-metadata-logd-test]       => http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#metadata-logd-test
    [filename-metadata-logd]       => /tmp/data-gov/iogdc-dump-ttl/metadata-logd.ttl
    [filename-metadata-logd-test]  => /tmp/data-gov/iogdc-dump-ttl/metadata-logd-test.ttl
    [option]                       => load
    [dir-start]                    => /mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source

(This is in the old [data root](csv2rdf4lod automation data root), which has been superseded by /srv/logd/data/source)

It then determines a list of version directories that should be part of the load:

  0 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/portalu-de/catalog/version/2011-Sep-13',
  1 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/datanest-fair-play-sk/catalog/version/2011-Sep-13',
  2 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/ottawa-ca/catalog/version/2011-Sep-13',
  ...
  ...
  81 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/datagm-org-uk/catalog/version/2011-Sep-14',
  82 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/data-vic-gov-au/catalog/version/2011-Sep-15',

It then switches on the load parameter and says that it will:

load conversion results into dump dir and then put all into one named graph in triple store - http://purl.org/twc/vocab/conversion/MetaDataset

It's using the deprecated vdelete (replaced by vdelete in May 2011 to eliminate need for sudo, add logging, and parametrize the virtuoso configuration parameters instead of hard-coded bindings.):

run_command ( "sudo /opt/virtuoso/scripts/vdelete " . $params["load-uri-target"] );

Fortunately, we can eliminate the hard-coded requirements (there's more than meets the eye - the script above also hard codes all virtuoso parameters...) and switch over to the latest without a hitch:

run_command ( '$CSV2RDF4LOD_HOME/bin/util/virtuoso/vdelete ' . $params["load-uri-target"] );

Now, we can switch to a development triple store without this php script even knowing, by switching the CSV2RDF4LOD environment variables (portion of cr-vars.sh's output shown):

CSV2RDF4LOD_PUBLISH_VIRTUOSO_PORT       1112
CSV2RDF4LOD_PUBLISH_VIRTUOSO_INI_PATH   /srv/logd/config/triple-store/virtuoso/development.ini

It is deleting named graph http://purl.org/twc/vocab/conversion/MetaDataset:

[load-uri-target] => http://purl.org/twc/vocab/conversion/MetaDataset

It then runs through all of the version directories listed (80ish) and gears up to call run_case_task by feeding it the following three values:

/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/portalu-de/catalog/version/2011-Sep-13
/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source
<all of $params, which contains the above two values>

run_case_task hops into the conversion cockpit, unzips the conversion, copies it to a tmp directory and loads it to the triple store:

/tmp/data-gov/iogdc-dump-ttl/portalu-de-catalog-2011-Sep-13.ttl -> http://purl.org/twc/vocab/conversion/MetaDataset

... with the deprecated vload (replaced by vload in May 2011 to eliminate need for sudo, eliminate needless file copying before load, add logging, and parametrize the virtuoso configuration parameters instead of hard-coded bindings.):

run_command ( "sudo /opt/virtuoso/scripts/vload ttl "

Fulfilling OpenSearch queries (what S2S needed) with SPARQL queries (what LOGD had)

The S2S Framework needs an OpenSearch web service to obtain data, which is provided by (http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch.php <- gemini:/var/www/html/logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch.php <- google svn). The OpenSearch XML (http://logd.tw.rpi.edu/ws/iogdc/1.1/opensearch.xml <- gemini:/var/www/html/logd.tw.rpi.edu/ws/iogdc/1.1/opensearch.xml <- google svn) describes the service, which accepts OpenSearch requests and fulfills them by executing SPARQL queries against the LOGD triple store.

http://logd.tw.rpi.edu/ws/iogdc/1.1 is version controlled in http://data-gov-wiki.googlecode.com/svn/trunk/web/logd.tw.rpi.edu/ws/iogdc/1.1 (this is a little prettier for navigation) and can be obtained by:

svn checkout http://data-gov-wiki.googlecode.com/svn/trunk/web/logd.tw.rpi.edu/ws/iogdc/1.1

The forward-facing OGDSearch.php depends on phpOGDSearch.php and phpWebUtil.php.

phpOGDSearch.php specifies the endpoint that it queries:

const CONFIG_SPARQL_ENDPOINT= "http://gemini.tw.rpi.edu:8890/sparql";

Testing S2S IOGDS demo with development data

There is an S2S SearchService instance that points to the OpenSearch description document (a.k.a. OpenSearch XML). Looking at these service descriptions, one can see that the service http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch-test.php is called to fulfill IOGDS's OpenSearch queries.

The URI for the IOGDC test S2S SearchService instance is http://logd.tw.rpi.edu/s2s/1/1/TestLogdIntlSearchService, which is not dereferenceable; it is only used to identify the service when calling the init(serviceURI) JavaScript function.

Diffing http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch.php and http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch-test.php:

@gemini:/var/www/html/logd.tw.rpi.edu/ws/iogdc/1.1$ diff OGDSearch.php OGDSearch-test.php
74,75c74,75
< $svc->params_config[OGDSearch::CONFIG_FIELD_TITLE] = "OGDSearch";
< $svc->params_config[OGDSearch::CONFIG_FIELD_URI_METADATASET] = "<http://purl.org/twc/vocab/conversion/MetaDataset>";
---
> $svc->params_config[OGDSearch::CONFIG_FIELD_TITLE] = "OGDSearch-test";
> $svc->params_config[OGDSearch::CONFIG_FIELD_URI_METADATASET] = "<http://purl.org/twc/vocab/conversion/MetaDataset-test>";

And then there was Drupal...

The "production demonstration" is currently at http://logd.tw.rpi.edu/demo/international_dataset_catalog_search.

Editing it (http://logd.tw.rpi.edu/node/9903/edit), we can see how the S2S

...
<script src="/s2s/scripts/core/logd-S2SWidget.js" type="text/javascript"></script>
...
<script src="http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch-test.php?request=TotalDataset"></script>
...

Clone this wiki locally