-
Notifications
You must be signed in to change notification settings - Fork 35
Characterizing table completeness
timrdf edited this page Sep 23, 2011
·
27 revisions
$ java edu.rpi.tw.data.csv.util.BinaryTable
usage: BinaryTable <file> [--comment-character char] [--header-line headerLineNumber] [--delimiter delimiter]
[--column-stop colNumber]
see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Characterizing-table-completeness
Column numbers along top, pattern occurrence frequency along right, completeness indication along bottom (|: all there, _: some missing).
The following sample output is applied against geonames US zip codes). One of the things this says is "41,940 rows have values for all cells except for cells 8 and 9. Three rows have values for all cells except for cells 6, 7, 8 and 9.".
bash-3.2$ java edu.rpi.tw.data.csv.util.BinaryTable source/US.txt --header-line 0 --delimiter '\t'
123456789012
..... ...| 3
....... .. | 41940
..... . .. | 1
... . .. | 4
....... ...| 147
... ... .. | 10
..... .. | 1408
...... .. | 84
... . . .. | 5
|||_|____||_
- Script: cr test conversion.sh uses this to gray out columns that are missing values, which helps design the enhancement parameters for geonames.