Skip to content

Rust HDT file sizes larger than C++ Version #1

@GregHanson

Description

@GregHanson

using the DBLP dataset (~490million triples) resulting file sizes:

ls -alh
-rw-r-----  1 gihanson TRMCMES1 3.8G Feb 25 13:16 dblp.hdt
-rw-------  1 gihanson TRMCMES1 5.9G Apr 14 12:06 dblp-rust.hdt
-rw-r-----  1 gihanson TRMCMES1  21G Apr 10 17:03 dblp.ttl

Dumps from HDT crate

Rust

[DEBUG hdt::triples] Building wavelet matrix...
[DEBUG hdt::triples] Building OPS index...
[DEBUG hdt::triples] Built wavelet matrix with length 374826687
[DEBUG hdt::triples] built OPS index
[DEBUG hdt::hdt] HDT size in memory 6.7 GiB, details:
[DEBUG hdt::hdt] Hdt {
        dict: FourSectDict {
            shared: total size 1.3 GiB, 57601109 strings, sequence 13.3 MiB with 3600071 entries, 31 bits per entry, packed data 1.3 GiB,
            subjects: total size 290.6 KiB, 41808 strings, sequence 6.1 KiB with 2614 entries, 19 bits per entry, packed data 284.6 KiB,
            predicates: total size 1.2 KiB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.2 KiB,
            objects: total size 1.4 GiB, 58079572 strings, sequence 13.4 MiB with 3629975 entries, 31 bits per entry, packed data 1.4 GiB,
        },
        triples: total size 4.0 GiB
        adjlist_z AdjList {
            sequence: 1.7 GiB with 504589029 entries, 29 bits per entry,
            bitmap: 78.0 MiB,
        }
        op_index total size 1.8 GiB {
            sequence: 1.7 GiB with 29 bits,
            bitmap: 75.2 MiB
        }
        wavelet_y 410.5 MiB,
    }

C++

[DEBUG hdt::triples] Building wavelet matrix...
[DEBUG hdt::triples] Building OPS index...
[DEBUG hdt::triples] Built wavelet matrix with length 374826687
[DEBUG hdt::triples] built OPS index
[DEBUG hdt::hdt] HDT size in memory 5.6 GiB, details:
[DEBUG hdt::hdt] Hdt {
        dict: FourSectDict {
            shared: total size 259.6 MiB, 57601109 strings, sequence 12.0 MiB with 3600071 entries, 28 bits per entry, packed data 247.6 MiB,
            subjects: total size 290.6 KiB, 41808 strings, sequence 6.1 KiB with 2614 entries, 19 bits per entry, packed data 284.6 KiB,
            predicates: total size 1.2 KiB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.2 KiB,
            objects: total size 1.4 GiB, 58079572 strings, sequence 13.4 MiB with 3629975 entries, 31 bits per entry, packed data 1.4 GiB,
        },
        triples: total size 4.0 GiB
        adjlist_z AdjList {
            sequence: 1.7 GiB with 504589029 entries, 29 bits per entry,
            bitmap: 78.0 MiB,
        }
        op_index total size 1.8 GiB {
            sequence: 1.7 GiB with 29 bits,
            bitmap: 75.2 MiB
        }
        wavelet_y 410.5 MiB,
    }

Analysis

Appears to be bug in shared dictionary?

            shared: total size 1.3 GiB, 57601109 strings, sequence 13.3 MiB with 3600071 entries, 31 bits per entry, packed data 1.3 GiB,

versus

            shared: total size 259.6 MiB, 57601109 strings, sequence 12.0 MiB with 3600071 entries, 28 bits per entry, packed data 247.6 MiB,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions