Skip to content

Assisting vocabulary selection

timrdf edited this page Jan 25, 2012 · 28 revisions

How does DataFAQs play a role in vocabulary selection? Would DataFAQs be used as part of an iterative process?

Yes. And Yes.

The vocabulary that one chooses to model their domain is critically important. Although many vocabularies may adequately communicate the topic of our interests, some vocabularies have more practical value than others.

To take an example from our most recent conversion, consider two alternate RDF forms of the same tabular row:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix void:    <http://rdfs.org/ns/void#> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .

@prefix local_vocab: 
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/vocab/> .
@prefix e1: 
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/vocab/enhancement/1/> .
@prefix biographical-directory-of-the-united-states-congress: 
  <http://localhost/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/> .
@prefix value_of_state: 
  <http://localhost/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/value-of/state/> .
@prefix :      
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04/> .


:congressperson_49 

  dcterms:isReferencedBy 
   <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04> ;
   void:inDataset 
    <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04> ;

   a local_vocab:Congressperson , foaf:Person ;

   foaf:firstName   "John" ;
   foaf:family_name "BULL" ;

   e1:congress   biographical-directory-of-the-united-states-congress:congress_0 ;
   foaf:memberOf biographical-directory-of-the-united-states-congress:congress_0 ; # sic
   foaf:workInfoHomepage <http://bioguide.congress.gov/scripts/biodisplay.pl?index=B001047> , 
                         <http://bioguide.congress.gov/scripts/guidedisplay.pl?index=B001047> , 
                         <http://bioguide.congress.gov/scripts/bibdisplay.pl?index=B001047> ;

   con:preferredURI      biographical-directory-of-the-united-states-congress:B001047 ;
   prov:specializationOf biographical-directory-of-the-united-states-congress:B001047 ;

   e1:doc "2012-01-04T02:12:01" ;
   dbpediaprop:state value_of_state:SC; 
.

value_of_state:SC 
   dcterms:identifier "SC" ;
   rdfs:label         "SC" ;
   owl:sameAs dbpedia:South_Carolina , 
             <http://sws.geonames.org/4597040/> , 
             govtrackusgov:SC .

Many semantic web developers would agree that some of the modeling above is slightly better than the modeling that follows:

@prefix : 
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04/> .
@prefix raw: 
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/vocab/raw/> .

:thing_49 
  dcterms:isReferencedBy 
  <http://localhost/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04> ;
  void:inDataset 
  <http://localhost/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04> ;

   raw:first_name "John" ;
   raw:last_name  "BULL" ;
   raw:congress   "0" ;
   raw:p_url      "http://bioguide.congress.gov/scripts/biodisplay.pl?index=B001047" ;
   raw:doc        "2012-01-04T02:12:01" ;
   raw:state      "SC" ;
   raw:death      "1802" ;
   raw:birth      "1740c" ;
   raw:party      " " ;
   raw:position   "ContCong" ;
   raw:c_yr       "" ;
   ov:csvRow      "49"^^xsd:integer .

But what, exactly is better about? Well, lots of things. Different people are concerned about different aspects of the difference shown above. Some claims about quality may include:

  • foaf:firstName is way better than raw:first_name because 400 systems recognize it and display it.
  • raw:p_url as a URI and label is incomprehensible to anyone that did not build this database. And it's a literal, which means that RDF agents will not know that it can be resolved on the web. Using foaf:workInfoHomepage is way better because it already exists to associate a person with their work homepages. And systems recognize foaf already. And people know foaf already.
  • e1:congress is way better than raw:congress because its value is a URI that can be further described. Being stuck with raw:congress's value "0" is very uninformative. What do I do with zero? At very least, we can type the biographical-directory-of-the-united-states-congress:congress_0 and start describing it's temporal interval, etc.
  • ACK! Someone starting using foaf:memberOf, when that URI is not defined in the foaf namespace! That violates Linked Data principles. On the other hand, it's pretty obvious what it is -- it's the inverse of foaf:member and we can use it and have systems recognize it even without the FOAF Elite defining it in their vocabulary. Practicality can trump principles. Depending on who you ask.
  • We might not know what local_vocab:Congressperson is, but at least we know it's a kind of person foaf:Person. We can work with that.
  • dbpediaprop:state :SC is way better than raw:state "SC" because lots of people run to dbpedia for example data, so more people will start using dbpediaprop:state. But when more people start using it without clear, established rules, they they'll use it inconsistently. So the relation will have many meanings and runs the risk of becoming meaningless.

DataFAQs: the evaluation framework that gives you a voice.

DataFAQs is not designed to declare authoritative quality of the datasets it comes by. Instead, it is a framework to allow interested stakeholders to express, survey, and understand the aspects of quality that they and others value. This increased community understanding -- accelerated by automated, asynchronous feedback -- provides the basis for stakeholders to make better, more informed decisions about the vocabulary that they use. Those decisions are based on concrete, qualitative information that is provided by the community, for the community. DataFAQs just connects all of the dots, accumulates perspectives on datasets, and allows you to explore what the community thinks about your dataset.

DataFAQs can and will be used to assist vocabulary selection.

It is important to remember that DataFAQs is not only a resource that provides "grades" for datasets that you point it to. More importantly, it is a framework that allows any stakeholder to reflect their needs, interests, or preferences when it comes to the quality of any dataset.

Clone this wiki locally