You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf(spark): use 256-entry byte-pair table in hex encoding (#21836)
Refs #15986.
**Why:** `spark_hex` walked one nibble at a time — two `HEX_CHARS[i]`
lookups and two `Vec::push` calls per input byte. The hot loop flattens
into one indexed load and one `extend_from_slice` per byte with a
precomputed table.
**What changed:** added `HEX_LOOKUP_LOWER` / `HEX_LOOKUP_UPPER` as
`[[u8; 2]; 256]` const tables built at compile time. Bytes path now does
a single lookup + 2-byte extend per input byte. The int64 path consumes
two nibbles per iteration via the same table, with a fall-through for
the high nibble. Behaviour for `0`, `i64::MAX`, `i64::MIN`, `-1`
preserved.
**Tests:** extended `test_hex_int64` to cover edge values; new
`test_hex_lookup_table_covers_all_bytes` cross-checks every entry
against `format!("{:02X/x}")`; new
`test_spark_hex_binary_round_trip_all_bytes` feeds all 256 byte values
through `spark_hex` and verifies the result.
`cargo test -p datafusion-spark --lib hex` → 8 pass. `cargo clippy
--all-features --all-targets` clean. `cargo bench --no-run` builds —
existing `benches/hex.rs` already covers
Int64/Utf8/Utf8View/LargeUtf8/Binary/LargeBinary plus dict paths.
**Not in this PR:** the #15947 review also flagged Utf8View output and
dictionary-key reuse — those felt worth their own PRs to keep this
focused on the per-byte hot path.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
0 commit comments