Skip to content

Characterizing table completeness

timrdf edited this page Sep 23, 2011 · 27 revisions
$ java edu.rpi.tw.data.csv.util.BinaryTable
usage: BinaryTable <file> [--comment-character char] [--header-line headerLineNumber] [--delimiter delimiter]
                          [--column-stop colNumber]
see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Characterizing-table-completeness

Column numbers along top, pattern occurrence frequency along right, completeness indication along bottom (|: all there, _: some missing).

The following sample output is applied against geonames US zip codes). One of the things this says is "41,940 rows have values for all cells except for cells 8 and 9. Three rows have values for all cells except for cells 6, 7, 8 and 9.".

bash-3.2$ java edu.rpi.tw.data.csv.util.BinaryTable source/US.txt --header-line 0 --delimiter '\t'
123456789012
.....    ...| 3
.......  .. | 41940
..... .  .. | 1
... .    .. | 4
.......  ...| 147
... ...  .. | 10
.....    .. | 1408
......   .. | 84
... . .  .. | 5
|||_|____||_

See also

Clone this wiki locally