using the DBLP dataset (~490million triples) resulting file sizes:
ls -alh
-rw-r----- 1 gihanson TRMCMES1 3.8G Feb 25 13:16 dblp.hdt
-rw------- 1 gihanson TRMCMES1 5.9G Apr 14 12:06 dblp-rust.hdt
-rw-r----- 1 gihanson TRMCMES1 21G Apr 10 17:03 dblp.ttl
Dumps from HDT crate
Rust
[DEBUG hdt::triples] Building wavelet matrix...
[DEBUG hdt::triples] Building OPS index...
[DEBUG hdt::triples] Built wavelet matrix with length 374826687
[DEBUG hdt::triples] built OPS index
[DEBUG hdt::hdt] HDT size in memory 6.7 GiB, details:
[DEBUG hdt::hdt] Hdt {
dict: FourSectDict {
shared: total size 1.3 GiB, 57601109 strings, sequence 13.3 MiB with 3600071 entries, 31 bits per entry, packed data 1.3 GiB,
subjects: total size 290.6 KiB, 41808 strings, sequence 6.1 KiB with 2614 entries, 19 bits per entry, packed data 284.6 KiB,
predicates: total size 1.2 KiB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.2 KiB,
objects: total size 1.4 GiB, 58079572 strings, sequence 13.4 MiB with 3629975 entries, 31 bits per entry, packed data 1.4 GiB,
},
triples: total size 4.0 GiB
adjlist_z AdjList {
sequence: 1.7 GiB with 504589029 entries, 29 bits per entry,
bitmap: 78.0 MiB,
}
op_index total size 1.8 GiB {
sequence: 1.7 GiB with 29 bits,
bitmap: 75.2 MiB
}
wavelet_y 410.5 MiB,
}
C++
[DEBUG hdt::triples] Building wavelet matrix...
[DEBUG hdt::triples] Building OPS index...
[DEBUG hdt::triples] Built wavelet matrix with length 374826687
[DEBUG hdt::triples] built OPS index
[DEBUG hdt::hdt] HDT size in memory 5.6 GiB, details:
[DEBUG hdt::hdt] Hdt {
dict: FourSectDict {
shared: total size 259.6 MiB, 57601109 strings, sequence 12.0 MiB with 3600071 entries, 28 bits per entry, packed data 247.6 MiB,
subjects: total size 290.6 KiB, 41808 strings, sequence 6.1 KiB with 2614 entries, 19 bits per entry, packed data 284.6 KiB,
predicates: total size 1.2 KiB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.2 KiB,
objects: total size 1.4 GiB, 58079572 strings, sequence 13.4 MiB with 3629975 entries, 31 bits per entry, packed data 1.4 GiB,
},
triples: total size 4.0 GiB
adjlist_z AdjList {
sequence: 1.7 GiB with 504589029 entries, 29 bits per entry,
bitmap: 78.0 MiB,
}
op_index total size 1.8 GiB {
sequence: 1.7 GiB with 29 bits,
bitmap: 75.2 MiB
}
wavelet_y 410.5 MiB,
}
Analysis
Appears to be bug in shared dictionary?
shared: total size 1.3 GiB, 57601109 strings, sequence 13.3 MiB with 3600071 entries, 31 bits per entry, packed data 1.3 GiB,
versus
shared: total size 259.6 MiB, 57601109 strings, sequence 12.0 MiB with 3600071 entries, 28 bits per entry, packed data 247.6 MiB,
using the DBLP dataset (~490million triples) resulting file sizes:
Dumps from HDT crate
Rust
[DEBUG hdt::triples] Building wavelet matrix... [DEBUG hdt::triples] Building OPS index... [DEBUG hdt::triples] Built wavelet matrix with length 374826687 [DEBUG hdt::triples] built OPS index [DEBUG hdt::hdt] HDT size in memory 6.7 GiB, details: [DEBUG hdt::hdt] Hdt { dict: FourSectDict { shared: total size 1.3 GiB, 57601109 strings, sequence 13.3 MiB with 3600071 entries, 31 bits per entry, packed data 1.3 GiB, subjects: total size 290.6 KiB, 41808 strings, sequence 6.1 KiB with 2614 entries, 19 bits per entry, packed data 284.6 KiB, predicates: total size 1.2 KiB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.2 KiB, objects: total size 1.4 GiB, 58079572 strings, sequence 13.4 MiB with 3629975 entries, 31 bits per entry, packed data 1.4 GiB, }, triples: total size 4.0 GiB adjlist_z AdjList { sequence: 1.7 GiB with 504589029 entries, 29 bits per entry, bitmap: 78.0 MiB, } op_index total size 1.8 GiB { sequence: 1.7 GiB with 29 bits, bitmap: 75.2 MiB } wavelet_y 410.5 MiB, }C++
[DEBUG hdt::triples] Building wavelet matrix... [DEBUG hdt::triples] Building OPS index... [DEBUG hdt::triples] Built wavelet matrix with length 374826687 [DEBUG hdt::triples] built OPS index [DEBUG hdt::hdt] HDT size in memory 5.6 GiB, details: [DEBUG hdt::hdt] Hdt { dict: FourSectDict { shared: total size 259.6 MiB, 57601109 strings, sequence 12.0 MiB with 3600071 entries, 28 bits per entry, packed data 247.6 MiB, subjects: total size 290.6 KiB, 41808 strings, sequence 6.1 KiB with 2614 entries, 19 bits per entry, packed data 284.6 KiB, predicates: total size 1.2 KiB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.2 KiB, objects: total size 1.4 GiB, 58079572 strings, sequence 13.4 MiB with 3629975 entries, 31 bits per entry, packed data 1.4 GiB, }, triples: total size 4.0 GiB adjlist_z AdjList { sequence: 1.7 GiB with 504589029 entries, 29 bits per entry, bitmap: 78.0 MiB, } op_index total size 1.8 GiB { sequence: 1.7 GiB with 29 bits, bitmap: 75.2 MiB } wavelet_y 410.5 MiB, }Analysis
Appears to be bug in shared dictionary?
versus