perf: fast-path inline strings in ByteViewGroupValueBuilder::vectorized_append#21794
perf: fast-path inline strings in ByteViewGroupValueBuilder::vectorized_append#21794EeshanBembi wants to merge 3 commits intoapache:mainfrom
Conversation
…ed_append When the input StringView/BinaryView array has no data buffers (all values ≤12 bytes, stored inline), skip the value() → make_view() round-trip in do_append_val_inner and instead copy the u128 views directly. Arrow guarantees valid arrays have zero-padded inline views, so the direct copy is semantically identical and lets the compiler vectorize the loop. Also pre-reserve views capacity in the slow path (non-inline strings) to avoid repeated Vec reallocation. Closes apache#21568
| // Pre-reserve views capacity to avoid repeated reallocation. | ||
| self.views.reserve(rows.len()); | ||
| for &row in rows { | ||
| self.do_append_val_inner(arr, row); |
There was a problem hiding this comment.
In principle we can make this faster as well - extend + reuse input view (instead of make_view) + avoid array.value(row).
Up to you - can be part of follow-up PR
There was a problem hiding this comment.
Thanks for the review! I can include that in this PR.
|
run benchmark aggregate_vectorized |
|
run benchmarks |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/bytes-view-v2 (9621004) to 95157ef (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/bytes-view-v2 (9621004) to 95157ef (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/bytes-view-v2 (9621004) to 95157ef (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/bytes-view-v2 (9621004) to 95157ef (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Three follow-up changes addressing PR review and CI feedback on apache#21794: 1. Remove `required-features = ["test_utils"]` from the `aggregate_vectorized` bench target. The gate caused the bench to be silently skipped when CI ran `cargo bench --features=parquet --bench aggregate_vectorized`. The bench compiles fine in a workspace build where `arrow/test_utils` is already enabled via feature unification through `datafusion-physical-expr`. 2. Replace `arrow::util::test_util::seedable_rng()` with `StdRng::seed_from_u64(42)` from the already-present `rand` crate. This was the only import from `test_util` (as opposed to `bench_util`) and removes the last explicit dependency on the `test_utils` feature in the benchmark. 3. Optimize the slow path in `vectorized_append_inner` (non-inline strings, `!data_buffers().is_empty()`), addressing Dandandan's review comment. Instead of calling `do_append_val_inner` which goes through `array.value(row)` (buffer lookup + slice construction) and then `make_view` (re-reads the first 4 bytes to build the prefix), the new path reads `arr.views()[row]` directly and copies `src.prefix` from the source `ByteView`. Benchmarks show 41–53% improvement for 64-byte strings and 6–16% improvement for random-length strings (null_density=0.0).
Which issue does this PR close?
Closes #21568.
Rationale for this change
ByteViewGroupValueBuilder::vectorized_appendwas doing unnecessary work for short strings (≤12 bytes): for each row it calledarray.value(row)to decode the u128 view into a&[u8], then calledmake_viewto re-encode it back into a u128. The inputGenericByteViewArrayalready stores inline values in exactly that u128 format, so the round-trip is redundant.This mirrors the existing
HAS_BUFFERSspecialisation invectorized_equal_to_inner, which uses the samedata_buffers().is_empty()guard to take a direct-view-compare fast path for inline strings.What changes are included in this PR?
In
vectorized_append_inner, theNulls::Nonebranch now dispatches onarr.data_buffers().is_empty():self.views.extend(rows.iter().map(|&row| arr.views()[row])). Arrow's validity invariant guarantees inline views are zero-padded, so direct copy is semantically identical tovalue() → make_view().self.views.reserve(rows.len())before the existing loop to avoid repeated reallocation.Are these changes tested?
Covered by the existing 6 unit tests in
bytes_view::tests, all passing unchanged.test_byte_view_vectorized_operation_special_caseexercises the fast path directly (11-byte strings, no data buffers).Are there any user-facing changes?
No. Internal performance improvement only.
Benchmark
inline_null_0.0_size_1000/vectorized_append(8-byte strings, no nulls, 1 000 rows):