Skip to content

Provenance

timrdf edited this page May 25, 2011 · 83 revisions

Measure of care and attention

http://altmetrics.org/workshop2011/

How many times has a dataset been converted (results)? More invocations of the converter correlates to the amount of human care and attention paid to the results. This must, of course be viewed in light of the distribution (and minimum) for all other datasets. The integral from -infin to the point dataset d is on the curve can be a measure for how much care has been given w.r.t. the rest (c.f. percentile). The winner for LOGD is clearly the NITRD conversion, since there were 12 tables each requiring cell-based conversion and some converter enhancements (and debugging) to accomplish.

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?version ?logs
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset>  {
    ?version conversion:num_invocation_logs ?logs
  }
}ORDER BY DESC(?logs)

Grouping the previous results by the source (results). Clearly, LOGD still spends most of its attention on data.gov data... (note: nitrd is missing for some reason):

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?contributor count(*) as ?logs
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset>  {
    ?version conversion:num_invocation_logs ?logs; 
             dcterms:contributor            ?contributor .
  }
}GROUP BY ?contributor ORDER BY DESC(?logs)

Proof Markup Language

PML predicate use distribution (results):

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?p count(*) as ?count
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
    [] ?p []
  }
  filter(regex(?p,'^http://inference-web.org/2.0.*'))
} group by ?p order by desc(?count)

PML class use distribution (results):

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?type count(*) as ?count
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
    [] a ?type
  }
  filter(regex(?type,'^http://inference-web.org/2.0.*'))
} group by ?type order by desc(?count))

Attribution

Where am I mentioned in LOGD's csv2rdf4lod instance? (results):

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?s ?p ?o
WHERE {
  {GRAPH ?g {
    ?s ?p <http://tw.rpi.edu/instances/TimLebo> 
  }}
  UNION 
  {GRAPH ?g {
    <http://tw.rpi.edu/instances/TimLebo> ?p ?o 
  }}
} order by ?s ?o ?p

Identifying a dataset's source data using retrieval and conversion provenance

(example being developed)

Make OWL property chain to assert dcterms:source:

  • file C was downloaded and used to create a void:subset T of conversion:Dataset D.
  • T dcterms:source C .
  • D dcterms:source C .
[] rdfs:subPropertyOf dcterms:source;
   owl:propertyChain (
       :todo
     ).

Crediting data catalog for discovery of a dataset

Can be done by looking at beginning of irw:redirects in pcurl.sh's provenance.

Using file hashes

Timestamped file hashes are used to describe instances of pmlp:Information

By comparing the timehash of the source/ file at justify.sh time (source -> manual) to pcurl.sh time, we can identify inconsistencies in the source/ file.

Data "freshness"

(from Andrea Splendiani on public-semweb-lifesci@w3.org)

we would need an extra information: how fresh the information is.

Do you know if there is any standard metadata to indicate the last refresh of the endpoint content ? Technically speaking this kind of information should be associated to data as provenance. In practice however, 90% of utility can be reached by having some state information for each big graph in the endpoint, corresponding to major data sources.

In practice it would be nice to have a standard dictionary so that we can ask to the triplestore: list of graphs/datasets.

for each of these (or for endpoint itself if this holds information which is "coherent" source-wise):

  • update frequency
  • last update
  • data source (type and in case link).

Matthew Gamble mentions the myriad of proposals for capturing this metadata http://www.w3.org/wiki/DatasetDynamics.

Heterogenious provenance

Incorporating first-party provenance into the csv2rdf4lod workflow - how do they connect?

Examples of first-parties that provide some provenance (or could, with an email, some hot cocoa, and perhaps a puppy).

  • NCBI eg 20 row example
  • Flu db is aggregating
  • CHSI aggregating
  • Impact teen

Day dictionary mentions it came from a sensor, something was aggregates. Etc.

Questions to answer:

  • How does a publisher express it?
  • How would a third patty annotate isolated data to claim its Source?

Spring 2010 Advanced Semantic Technologies Project - triple-level provenance - Tech Report

http://www.geonames.org/data-sources.html lists a couple of dozen sources.

JWS section

Collecting, converting, enhancing, and republishing data raises important concerns about the integrity of the resulting data products and applications that use them. For TWC LGOD, this is especially true as a university aggregating a variety of governmental data from both government and non-government sources -- each with their own degree of authority and trustworthiness. To address this, we have incorporated provenance capture to facilitate transparency by allowing data consumers to inspect and query the lineage of the data products we produce. The ability to accurately cite original data source organizations is a direct consequence of this transparency, allowing improved credibility for derived data products and applications. This additional metadata also gives credit where credit is due for those you have spent a lot of time and energy to create the original data [EOS, TRANSACTIONS, AMERICAN GEOPHYSICAL UNION 91 p. 297-298].

Provenance in the LOGD workflow begins with the naming of the dataset. Short identifiers for the “source”, “dataset”, and “version” are central to the construction of the dataset’s URI and implicitly place it within the provenance context of who, what, and which. The URLs from which the government data files are retrieved is captured at the time they are retrieved and is encoded using the Proof Markup Language (PML) [5]. Although relatively simple, this information is critically valuable for associating any subsequent data products to its authoritative source. Through the rest of the LOGD workflow, data products are organized into those produced automatically (and repeatably) and those influenced by manual effort (and less repeatably) with their causal associations captured and encoded using PML. The development of a converter capable of interpreting the tabular structure of CSV formats according to declarative parameters [http://doi.acm.org/10.1145/1839707.1839755] was essential for minimizing the amount of manual modification of original government files.

In addition to capturing the provenance among holistic files, the csv2rdf4lod converter provides provenance at the granular triple. This ability was motivated by previous analysis of user-based trust in semantic-based mashups [http://dx.doi.org/10.1007/978-3-642-17819-1_21]. This allows inquiry and inspection as the assertion level, such as “How do you know that the UK gave Ethiopia $107,958,576 USD for Education in 2007/8?” The following figure shows one web application leveraging this granular provenance. Clicking the text “oh yeah?” in the table invokes a SPARQL DESCRIBE query on the triple’s subject and predicate, causing provenance fragments from the original CSV’s rows and columns to be combined to identify the original spreadsheet’s URL, the cell that caused the triple, the interpretation parameters applied, and the author of the parameters.

Attribution example

From http://hints.cancer.gov/dataset.jsp, http://hints.cancer.gov/agreement.jsp?selected=2007SAS:

HINTS Data Terms of Use

It is of utmost importance to ensure the confidentiality of survey participants. Every effort has been made to exclude identifying information on individual respondents from the computer files. Some demographic information such as sex, race, etc., has been included for research purposes. NCI expects that users of the data set will adhere to the strictest standards of ethical conduct for the analysis and reporting of nationally collected survey data. It is mandatory that all research results be presented/published in a manner that protects the integrity of the data and ensures the confidentiality of participants.

In order for the Health Information National Trends Survey (HINTS) to provide a public-use or another version of data to you, it is necessary that you agree to the following provisions.

   1. You will not present/publish data in which an individual can be identified. Publication of small cell sizes should be avoided.
   2. You will not attempt to link nor permit others to link the data with individually identified records in another database.
   3. You will not attempt to learn the identity of any person whose data are contained in the supplied file(s).
   4. If the identity of any person is discovered inadvertently, then the following should be done;
         1. no use will be made of this knowledge,
         2. the HINTS Program staff will be notified of the incident,
         3. no one else will be informed of the discovered identity.
   5. You will not release nor permit others to release the data in full or in part to any person except with the written approval of the HINTS Program staff.
   6. If accessing the data from a centralized location on a time sharing computer system or LAN, you will not share your logon name and password with any other individuals. You will also not allow any other individuals to use your computer account after you have logged on with your logon name and password.
   7. For all software provided by the HINTS Program, you will not copy, distribute, reverse engineer, profit from its sale or use, or incorporate it in any other software system.
   8. The source of information should be cited in all publications. The appropriate citation is associated with the data file used. Please see Suggested Citations in the Download HINTS Data section of this Web site, or the Readme.txt associated with the ASCII text version of the HINTS data.
   9. Analyses of large HINTS domains usually produce reliable estimates, but analyses of small domains may yield unreliable estimates, as indicated by their large variances. The analyst should pay particular attention to the standard error and coefficient of variation (relative standard error) for estimates of means, proportions, and totals, and the analyst should report these when writing up results. It is important that the analyst realizes that small sample sizes for particular analyses will tend to result in unstable estimates.
  10. You may receive periodic e-mail updates from the HINTS administrators.

http://projects.iq.harvard.edu/datacitation_workshop/pages/attendees

DFID moving example

Sure, the first provenance broke. But subsequent versions of the same dataset retrieved using the newer URL. TODO: figure out how to recognize this and recover from it. http://logd.tw.rpi.edu/source/dfid-gov-uk/dataset_page/statistics-on-international-development-2009

Formerly known as

http://oas.samhsa.gov/WebOnly.htm#NSDUHtabs

Detailed Tables
National Survey on Drug Use & Health
formerly called the Household Survey on Drug Abuse (NHSDA) 

Workshops

Clone this wiki locally