feat: support GroupsAccumulator for first_value and last_value with string/binary types by UBarney · Pull Request #21090 · apache/datafusion

UBarney · 2026-03-21T09:53:40Z

Which issue does this PR close?

Closes Implement GroupsAccumulator for first_value aggregate (speed up first_value and DISTINCT ON queries) #17899.

Rationale for this change

┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        main ┃ first_val_group_acc ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │   872.01 ms │           904.35 ms │     no change │
│ QQuery 1  │   156.37 ms │           164.83 ms │  1.05x slower │
│ QQuery 2  │   448.58 ms │           497.05 ms │  1.11x slower │
│ QQuery 3  │   233.99 ms │           274.10 ms │  1.17x slower │
│ QQuery 4  │  1448.99 ms │          1556.95 ms │  1.07x slower │
│ QQuery 5  │ 10816.83 ms │         11315.69 ms │     no change │
│ QQuery 6  │  2053.16 ms │          2030.02 ms │     no change │
│ QQuery 7  │  2154.74 ms │          2274.63 ms │  1.06x slower │
│ QQuery 8  │   405.62 ms │           405.72 ms │     no change │
│ QQuery 9  │ 17160.65 ms │          4167.34 ms │ +4.12x faster │
│ QQuery 10 │  1206.03 ms │          1090.69 ms │ +1.11x faster │
│ QQuery 11 │  2437.12 ms │          2446.51 ms │     no change │
│ QQuery 12 │   331.27 ms │           317.73 ms │     no change │
└───────────┴─────────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                  │ 39725.35ms │
│ Total Time (first_val_group_acc)   │ 27445.62ms │
│ Average Time (main)                │  3055.80ms │
│ Average Time (first_val_group_acc) │  2111.20ms │
│ Queries Faster                     │          2 │
│ Queries Slower                     │          5 │
│ Queries with No Change             │          6 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘

Previously, the first_value and last_value aggregate functions only supported GroupsAccumulator for primitive types. For string or binary types (Utf8, LargeUtf8, Binary, etc.), they fell back to the slower row-based Accumulator path.

This change implements a specialized state management for byte-based types, enabling high-performance grouped aggregation for strings and binary data, especially when used with ORDER BY.

What changes are included in this PR?

New ValueState Trait: Abstracted the state management for first_value and last_value to support different storage backends.
PrimitiveValueState : Re-implemented the existing primitive handling using the new trait.
BytesValueState: Added a new state implementation for Utf8, LargeUtf8, Utf8View, Binary, LargeBinary, and BinaryView. It
optimizes memory by reusing Vec<u8> buffers for group updates.
Refactored FirstLastGroupsAccumulator: Migrated the accumulator to use the generic ValueState trait, allowing it to handle both primitive and byte types uniformly.

Are these changes tested?

YES

Are there any user-facing changes?

Dandandan · 2026-03-21T12:37:29Z

+///    to correctly implement `RESPECT NULLS` behavior.
+///
+pub(crate) struct BytesValueState {
+    vals: Vec<Option<Vec<u8>>>,


I think this can be much more efficiently stored as values Vec<u8> and offsets Vec<OffsetType>

I plan to implement it using the following approach and then run some benchmarks:

Data Structures

vals: Vec<u8>: A single, contiguous flat buffer for all raw bytes.

offsets: Vec<usize>: The starting position of each group's data in the buffer.

lengths: Vec<usize>: The logical length of the current value for each group.

capacities: Vec<usize>: The physical space allocated for each group (enables in-place overwrites if new_len <= capacity).

active_bytes: usize: A running counter of the sum of all current lengths (used to track fragmentation and trigger GC).

Update Logic

In-place Overwrite: If new_len <= capacity, we overwrite the existing slot at the current offset. We update the logical length, while capacity and offset remain unchanged.

Append: If new_len > capacity, we append the value to the end of vals and update the offset, length, and capacity to point to the new location.

GC (Compaction) Logic

Trigger: When the buffer grows too large (e.g., vals.len() > active_bytes * 2).

Action: Re-allocate a new buffer and copy only the latest valid data for each group to clear "dead" bytes left behind by the append path.

This is only needed for last_value, no?

Or wait nvm I see the queries having explicit order.

After implementing the flattened approach with Vec<u8>, the performance actually regressed in several scenarios. Below are the benchmark results:

ID SQL first_val_str Time(s) first_last_flatten_acc Time(s) Performance Change Note

1 select t.id1, first_value(t.id3 order by t.id2, t.id4) as r2 from 'benchmarks/data/h2o/G1_1e8_1e8_100_0.parquet' as t group by t.id1, t.v1; 0.690 0.666 +1.04x faster 🚀 Length of t.id3 is constant (12) per group

2 select l_shipmode, first_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode; 0.724 0.776 1.07x slower 🐌

3 select t.id2, t.id4, first_value(t.v1 order by t.id2, t.id4) as r2 from 'benchmarks/data/h2o/G1_1e8_1e8_100_0.parquet' as t group by t.id2, t.id4; 7.136 7.226 1.01x slower 🐌

4 SELECT l_suppkey, FIRST_VALUE(l_comment ORDER BY l_orderkey DESC) as fv FROM 'benchmarks/data/tpch_sf10/lineitem' GROUP BY l_suppkey; 2.914 3.206 1.10x slower 🐌 l_comment length varies (10-43) per group

5 select t.id1, last_value(t.id3 order by t.id2, t.id4) as r2 from 'benchmarks/data/h2o/G1_1e8_1e8_100_0.parquet' as t group by t.id1, t.v1; 0.745 0.747 1.00x slower 🐌

6 select l_shipmode, last_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode; 0.802 0.780 +1.03x faster 🚀

7 select t.id2, t.id4, last_value(t.v1 order by t.id2, t.id4) as r2 from 'benchmarks/data/h2o/G1_1e8_1e8_100_0.parquet' as t group by t.id2, t.id4; 7.256 7.277 1.00x slower 🐌

Flame graphs indicate that the overhead from resize operations is the primary bottleneck.

For first_value / last_value queries without an ORDER BY clause, each group's value is only set once since there is no explicit ordering requirement. In this scenario, it might be more efficient to implement a dedicated ValueState using Vec<Vec<u8>> to store strings/bytes, where each inner Vec<u8> has a fixed maximum size. We can pre-allocate this capacity during creation to avoid frequent resizing. Note that for these unordered queries, last_value can be implemented with the same logic as first_value.

Hmm yes, this is similar to the block-based aggregation storage @Rachelint was working on (resizes are heavy, especially for write-only data).

UBarney · 2026-03-26T13:34:28Z


    fn groups_accumulator_supported(&self, args: AccumulatorArgs) -> bool {
-        use DataType::*;
-        !args.order_bys.is_empty()


We might want to consider adding a new GroupsAccumulator to handle cases without ORDER BY, or perhaps implement a fast path in FirstLastGroupsAccumulator for when no ordering is required

neilconway

Would it be useful to add a self-contained benchmark to measure the performance change here?

neilconway · 2026-03-26T14:53:26Z

+
+    fn update(&mut self, group_idx: usize, array: &ArrayRef, idx: usize) -> Result<()> {
+        if array.is_null(idx) {
+            self.vals[group_idx] = None;


Decrement total_capacity here? A unit test for the update-with-null case might be useful as well.

Thanks for the catch! I've updated the code to decrement total_capacity here and added a unit test for the update-with-null case as suggested

neilconway · 2026-03-26T14:54:42Z

+    /// Note: While this is not a batch interface, it is not a performance bottleneck.
+    /// In heavy aggregation benchmarks, the overhead of this method is typically less than 1%.
+    ///
+    /// Benchmarked queries with < 1% `update` overhead:


This seems like more detail than necessary?

Done. I've simplified this and removed the extra details to keep it concise.

neilconway · 2026-03-26T15:08:45Z

+
+
+#################
+# first_value on strings/binary with groups and ordering


Test last_value as well?

Good point. I've added a test case for last_value in aggregate.slt as well

neilconway · 2026-03-26T15:11:05Z

@@ -342,8 +422,7 @@ where
    // buffer for `get_filtered_min_of_each_group`


These comments should be updated.

Fixed. I've updated the comments to correctly reflect the current implementation

neilconway · 2026-03-26T15:21:10Z

        .unwrap()
 }

+fn create_groups_primitive_accumulator<T: ArrowPrimitiveType + Send>(


create_groups_primitive_accumulator and create_groups_bytes_accumulator are identical except for the ValueState; we could use a single function and pass in the value state as an argument?

Agreed. I've refactored these into a single function and passed the state as an argument

UBarney · 2026-03-30T13:33:59Z

+-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
+-- set datafusion.execution.parquet.binary_as_string = true
+
+SELECT "RegionID", "UserAgent", "OS", AVG(to_timestamp("ResponseEndTiming")-to_timestamp("ResponseStartTiming")) as avg_response_time, AVG(to_timestamp("ResponseEndTiming")-to_timestamp("ConnectTiming")) as avg_latency FROM hits GROUP BY "RegionID", "UserAgent", "OS" ORDER BY avg_latency DESC limit 10;


This query was missed in this PR.

UBarney · 2026-03-30T13:46:23Z

Would it be useful to add a self-contained benchmark to measure the performance change here?

I added first_value benchmarks to clickbench_extend. I skipped last_value as its implementation logic is identical to first_value. @neilconway

neilconway · 2026-03-30T13:50:40Z

I added first_value benchmarks to clickbench_extend. I skipped last_value as its implementation logic is identical to first_value. @neilconway

Hmm, I think it is more typical to either add a new criterion-based benchmark or extending an existing one to include a benchmark for first_ / last_value -- for example as done here https://github.com/apache/datafusion/pull/21154/changes#diff-45ef5fab11bdcc96f0716ae2b2dc9685d75a515e660c6d082e1e4c22a83c43aaR299

UBarney · 2026-03-31T03:41:30Z

I added first_value benchmarks to clickbench_extend. I skipped last_value as its implementation logic is identical to first_value. @neilconway

Hmm, I think it is more typical to either add a new criterion-based benchmark or extending an existing one to include a benchmark for first_ / last_value -- for example as done here https://github.com/apache/datafusion/pull/21154/changes#diff-45ef5fab11bdcc96f0716ae2b2dc9685d75a515e660c6d082e1e4c22a83c43aaR299

I believe clickbench_extend is more suitable here for two reasons:

The dataset is significantly larger, which better demonstrates the performance gains (I'm seeing a ~4x improvement) compared to a micro-benchmark that runs in under 10ms.
There is already a precedent for using clickbench_extend to test aggregate function performance, for example:

datafusion/benchmarks/queries/clickbench/README.md

Line 64 in 7145928

**Important Query Properties**: STDDEV and VAR aggregation functions, GROUP BY multiple small ints

What do you think?

Dandandan · 2026-03-31T05:48:53Z

run benchmark clickbench_extended

adriangbot · 2026-03-31T05:51:30Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4160090093-624-8rgr2 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing first_val_group_acc_string (a29ef7d) to ccaf802 (merge-base) diff using: clickbench_extended
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-03-31T06:12:18Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and first_val_group_acc_string
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query    ┃                                     HEAD ┃               first_val_group_acc_string ┃    Change ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0 │       804.10 / 828.70 ±20.40 / 865.72 ms │        825.65 / 836.61 ±6.02 / 841.86 ms │ no change │
│ QQuery 1 │        207.21 / 208.96 ±1.08 / 210.38 ms │        207.17 / 208.25 ±0.70 / 209.26 ms │ no change │
│ QQuery 2 │        497.81 / 502.70 ±2.79 / 505.56 ms │        498.37 / 501.40 ±2.62 / 504.82 ms │ no change │
│ QQuery 3 │        314.36 / 314.82 ±0.33 / 315.34 ms │        314.04 / 315.49 ±0.86 / 316.75 ms │ no change │
│ QQuery 4 │       653.51 / 683.29 ±22.84 / 711.12 ms │        667.78 / 680.49 ±8.15 / 689.50 ms │ no change │
│ QQuery 5 │ 9810.71 / 10101.94 ±187.72 / 10392.09 ms │ 9705.66 / 10087.60 ±202.53 / 10258.60 ms │ no change │
│ QQuery 6 │     986.14 / 1000.78 ±18.89 / 1037.76 ms │      997.02 / 1004.84 ±9.16 / 1019.89 ms │ no change │
│ QQuery 7 │       772.93 / 798.13 ±17.32 / 827.24 ms │       791.05 / 815.99 ±20.40 / 852.40 ms │ no change │
└──────────┴──────────────────────────────────────────┴──────────────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 14439.32ms │
│ Total Time (first_val_group_acc_string)   │ 14450.67ms │
│ Average Time (HEAD)                       │  1804.91ms │
│ Average Time (first_val_group_acc_string) │  1806.33ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          8 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_extended — base (merge-base)

Metric	Value
Wall time	72.8s
Peak memory	32.0 GiB
Avg memory	27.0 GiB
CPU user	727.9s
CPU sys	23.9s
Disk read	0 B
Disk write	22.4 MiB

clickbench_extended — branch

Metric	Value
Wall time	72.8s
Peak memory	31.1 GiB
Avg memory	26.8 GiB
CPU user	726.2s
CPU sys	24.6s
Disk read	0 B
Disk write	700.0 KiB

File an issue against this benchmark runner

UBarney · 2026-03-31T07:48:55Z

> │ QQuery 5 │ 9810.71 / 10101.94 ±187.72 / 10392.09 ms │ 9705.66 / 10087.60 ±202.53 / 10258.60 ms │ no change │
> │ QQuery 6 │     986.14 / 1000.78 ±18.89 / 1037.76 ms │      997.02 / 1004.84 ±9.16 / 1019.89 ms │ no change │
> │ QQuery 7 │       772.93 / 798.13 ±17.32 / 827.24 ms │       791.05 / 815.99 ±20.40 / 852.40 ms │ no change │
> └──────────┴──────────────────────────────────────────┴──────────────────────────────────────────┴───────────┘
> ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
> ┃ Benchmark Summary                         ┃            ┃
> ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
> │ Total Time (HEAD)                         │ 14439.32ms │

It looks like the benchmark SQL from the main branch is being used, instead of including the benchmark SQL from this branch

Dandandan · 2026-04-03T12:56:50Z

+| Query | `FIRST_VALUE` Column | Column Type | Group By Column | Group By Type | Number of Groups |
+|-------|----------------------|-------------|-----------------|---------------|------------------|
+| Q9    | `URL`                | `Utf8`      | `UserID`        | `Int64`       | 17,630,976       |
+| Q10    | `URL`                | `Utf8`      | `OS`            | `Int16`       | 91               |


This table formatting looks a bit off

neilconway · 2026-04-03T17:45:02Z

+            self.vals[group_idx]
+                .as_ref()
+                .inspect(|x| self.total_capacity += x.capacity());


I think more idiomatic:

if let Some(v) = &self.vals[group_idx] { self.total_capacity += v.capacity(); }

Maybe we can clean this up as a follow on PR?

I'll clean this up when I address #21090 (comment)

alamb · 2026-04-03T21:20:13Z

It looks like the benchmark SQL from the main branch is being used, instead of including the benchmark SQL from this branch

Yes you if you want to compare this branch with our scrupts, you need to make a PR with just the benchmarks that we can merge to main first

alamb · 2026-04-04T10:00:45Z

thanks @UBarney @Dandandan and @neilconway -- this is great

theirix · 2026-04-05T10:30:25Z

It looks like the benchmark SQL from the main branch is being used, instead of including the benchmark SQL from this branch

Yes you if you want to compare this branch with our scrupts, you need to make a PR with just the benchmarks that we can merge to main first

I added a function-level microbenchmark (not a SQL benchmark, but could be still useful) and a few more improvements - #21383

…tring/binary types (apache#21090) ## Which issue does this PR close?  - Closes apache#17899. ## Rationale for this change ``` ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ main ┃ first_val_group_acc ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 0 │ 872.01 ms │ 904.35 ms │ no change │ │ QQuery 1 │ 156.37 ms │ 164.83 ms │ 1.05x slower │ │ QQuery 2 │ 448.58 ms │ 497.05 ms │ 1.11x slower │ │ QQuery 3 │ 233.99 ms │ 274.10 ms │ 1.17x slower │ │ QQuery 4 │ 1448.99 ms │ 1556.95 ms │ 1.07x slower │ │ QQuery 5 │ 10816.83 ms │ 11315.69 ms │ no change │ │ QQuery 6 │ 2053.16 ms │ 2030.02 ms │ no change │ │ QQuery 7 │ 2154.74 ms │ 2274.63 ms │ 1.06x slower │ │ QQuery 8 │ 405.62 ms │ 405.72 ms │ no change │ │ QQuery 9 │ 17160.65 ms │ 4167.34 ms │ +4.12x faster │ │ QQuery 10 │ 1206.03 ms │ 1090.69 ms │ +1.11x faster │ │ QQuery 11 │ 2437.12 ms │ 2446.51 ms │ no change │ │ QQuery 12 │ 331.27 ms │ 317.73 ms │ no change │ └───────────┴─────────────┴─────────────────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ Total Time (main) │ 39725.35ms │ │ Total Time (first_val_group_acc) │ 27445.62ms │ │ Average Time (main) │ 3055.80ms │ │ Average Time (first_val_group_acc) │ 2111.20ms │ │ Queries Faster │ 2 │ │ Queries Slower │ 5 │ │ Queries with No Change │ 6 │ │ Queries with Failure │ 0 │ └────────────────────────────────────┴────────────┘ ``` Previously, the `first_value` and `last_value` aggregate functions only supported GroupsAccumulator for primitive types. For string or binary types (Utf8, LargeUtf8, Binary, etc.), they fell back to the slower row-based Accumulator path. This change implements a specialized state management for byte-based types, enabling high-performance grouped aggregation for strings and binary data, especially when used with `ORDER BY`.  ## What changes are included in this PR? - New `ValueState` Trait: Abstracted the state management for `first_value` and `last_value` to support different storage backends. - `PrimitiveValueState` : Re-implemented the existing primitive handling using the new trait. - `BytesValueState`: Added a new state implementation for Utf8, LargeUtf8, Utf8View, Binary, LargeBinary, and BinaryView. It optimizes memory by reusing `Vec<u8>` buffers for group updates. - Refactored `FirstLastGroupsAccumulato`r: Migrated the accumulator to use the generic ValueState trait, allowing it to handle both primitive and byte types uniformly.  ## Are these changes tested? YES  ## Are there any user-facing changes?   --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 21, 2026

Dandandan reviewed Mar 21, 2026

View reviewed changes

feat: implement FirstValueState with byte handling optimization

779706c

UBarney force-pushed the first_val_group_acc_string branch from 330a6e1 to 779706c Compare March 26, 2026 10:29

UBarney marked this pull request as ready for review March 26, 2026 10:44

UBarney commented Mar 26, 2026

View reviewed changes

neilconway reviewed Mar 26, 2026

View reviewed changes

UBarney marked this pull request as draft March 29, 2026 13:50

address comment

7145928

UBarney force-pushed the first_val_group_acc_string branch from bab467e to 7145928 Compare March 30, 2026 13:29

UBarney commented Mar 30, 2026

View reviewed changes

UBarney marked this pull request as ready for review March 30, 2026 14:01

Merge branch 'main' into first_val_group_acc_string

a29ef7d

UBarney requested review from Dandandan and neilconway April 3, 2026 08:19

Dandandan reviewed Apr 3, 2026

View reviewed changes

Dandandan approved these changes Apr 3, 2026

View reviewed changes

neilconway reviewed Apr 3, 2026

View reviewed changes

Merge branch 'main' into first_val_group_acc_string

15e7434

alamb added this pull request to the merge queue Apr 4, 2026

alamb removed this pull request from the merge queue due to a manual request Apr 4, 2026

alamb added this pull request to the merge queue Apr 4, 2026

alamb added the performance Make DataFusion faster label Apr 4, 2026

Merged via the queue into apache:main with commit 587f4c0 Apr 4, 2026
31 checks passed

ID	SQL	first_val_str Time(s)	first_last_flatten_acc Time(s)	Performance Change	Note
1	`select t.id1, first_value(t.id3 order by t.id2, t.id4) as r2 from 'benchmarks/data/h2o/G1_1e8_1e8_100_0.parquet' as t group by t.id1, t.v1;`	0.690	0.666	+1.04x faster 🚀	Length of `t.id3` is constant (12) per group
2	`select l_shipmode, first_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode;`	0.724	0.776	1.07x slower 🐌
3	`select t.id2, t.id4, first_value(t.v1 order by t.id2, t.id4) as r2 from 'benchmarks/data/h2o/G1_1e8_1e8_100_0.parquet' as t group by t.id2, t.id4;`	7.136	7.226	1.01x slower 🐌
4	`SELECT l_suppkey, FIRST_VALUE(l_comment ORDER BY l_orderkey DESC) as fv FROM 'benchmarks/data/tpch_sf10/lineitem' GROUP BY l_suppkey;`	2.914	3.206	1.10x slower 🐌	l_comment length varies (10-43) per group
5	`select t.id1, last_value(t.id3 order by t.id2, t.id4) as r2 from 'benchmarks/data/h2o/G1_1e8_1e8_100_0.parquet' as t group by t.id1, t.v1;`	0.745	0.747	1.00x slower 🐌
6	`select l_shipmode, last_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode;`	0.802	0.780	+1.03x faster 🚀
7	`select t.id2, t.id4, last_value(t.v1 order by t.id2, t.id4) as r2 from 'benchmarks/data/h2o/G1_1e8_1e8_100_0.parquet' as t group by t.id2, t.id4;`	7.256	7.277	1.00x slower 🐌



		#################
		# first_value on strings/binary with groups and ordering

		@@ -342,8 +422,7 @@ where
		// buffer for `get_filtered_min_of_each_group`

Conversation

UBarney commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Data Structures

Update Logic

GC (Compaction) Logic

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neilconway left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

UBarney commented Mar 30, 2026

Uh oh!

neilconway commented Mar 30, 2026

Uh oh!

UBarney commented Mar 31, 2026

Uh oh!

Dandandan commented Mar 31, 2026

Uh oh!

adriangbot commented Mar 31, 2026

Uh oh!

adriangbot commented Mar 31, 2026

Uh oh!

UBarney commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 3, 2026

Uh oh!

alamb commented Apr 4, 2026

Uh oh!

Uh oh!

theirix commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UBarney commented Mar 21, 2026 •

edited

Loading

theirix commented Apr 5, 2026 •

edited

Loading