-
Notifications
You must be signed in to change notification settings - Fork 35
Characterizing table completeness
timrdf edited this page Sep 23, 2011
·
27 revisions
$ java edu.rpi.tw.data.csv.util.BinaryTable
usage: BinaryTable <file> [--comment-character char] [--header-line headerLineNumber] [--delimiter delimiter]
[--column-stop colNumber]
see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Characterizing-table-completeness
Column numbers along top, pattern occurrence frequency along right, completeness indication along bottom (| indicates that all cells in this column have values in all rows; _ indicates that come cells in this column are missing values).
The following sample output is applied against geonames US zip codes). One of the things this says is "41,940 rows have values for all cells except for cells 8 and 9. Three rows have values for all cells except for cells 6, 7, 8 and 9.".
bash-3.2$ java edu.rpi.tw.data.csv.util.BinaryTable source/US.txt --header-line 0 --delimiter '\t'
123456789012
..... ...| 3
....... .. | 41940
..... . .. | 1
... . .. | 4
....... ...| 147
... ... .. | 10
..... .. | 1408
...... .. | 84
... . . .. | 5
|||_|____||_
- Script: cr test conversion.sh uses this to gray out columns that are missing values, which helps design the enhancement parameters for geonames.