feat: support GroupsAccumulator for first_value and last_value with string/binary types (#21090)

UBarney · alamb · web-flow · commit 587f4c0c7435 · 2026-04-04T10:00:34.000Z
## Which issue does this PR close?  - Closes #17899. ## Rationale for this change ``` ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ main ┃ first_val_group_acc ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 0 │ 872.01 ms │ 904.35 ms │ no change │ │ QQuery 1 │ 156.37 ms │ 164.83 ms │ 1.05x slower │ │ QQuery 2 │ 448.58 ms │ 497.05 ms │ 1.11x slower │ │ QQuery 3 │ 233.99 ms │ 274.10 ms │ 1.17x slower │ │ QQuery 4 │ 1448.99 ms │ 1556.95 ms │ 1.07x slower │ │ QQuery 5 │ 10816.83 ms │ 11315.69 ms │ no change │ │ QQuery 6 │ 2053.16 ms │ 2030.02 ms │ no change │ │ QQuery 7 │ 2154.74 ms │ 2274.63 ms │ 1.06x slower │ │ QQuery 8 │ 405.62 ms │ 405.72 ms │ no change │ │ QQuery 9 │ 17160.65 ms │ 4167.34 ms │ +4.12x faster │ │ QQuery 10 │ 1206.03 ms │ 1090.69 ms │ +1.11x faster │ │ QQuery 11 │ 2437.12 ms │ 2446.51 ms │ no change │ │ QQuery 12 │ 331.27 ms │ 317.73 ms │ no change │ └───────────┴─────────────┴─────────────────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ Total Time (main) │ 39725.35ms │ │ Total Time (first_val_group_acc) │ 27445.62ms │ │ Average Time (main) │ 3055.80ms │ │ Average Time (first_val_group_acc) │ 2111.20ms │ │ Queries Faster │ 2 │ │ Queries Slower │ 5 │ │ Queries with No Change │ 6 │ │ Queries with Failure │ 0 │ └────────────────────────────────────┴────────────┘ ``` Previously, the `first_value` and `last_value` aggregate functions only supported GroupsAccumulator for primitive types. For string or binary types (Utf8, LargeUtf8, Binary, etc.), they fell back to the slower row-based Accumulator path. This change implements a specialized state management for byte-based types, enabling high-performance grouped aggregation for strings and binary data, especially when used with `ORDER BY`.  ## What changes are included in this PR? - New `ValueState` Trait: Abstracted the state management for `first_value` and `last_value` to support different storage backends. - `PrimitiveValueState` : Re-implemented the existing primitive handling using the new trait. - `BytesValueState`: Added a new state implementation for Utf8, LargeUtf8, Utf8View, Binary, LargeBinary, and BinaryView. It optimizes memory by reusing `Vec<u8>` buffers for group updates. - Refactored `FirstLastGroupsAccumulato`r: Migrated the accumulator to use the generic ValueState trait, allowing it to handle both primitive and byte types uniformly.  ## Are these changes tested? YES  ## Are there any user-facing changes?   --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
diff --git a/benchmarks/queries/clickbench/README.md b/benchmarks/queries/clickbench/README.md
@@ -228,6 +228,22 @@ Results look like
 Elapsed 30.195 seconds.
 ```
 
+
+### Q9-Q12: FIRST_VALUE Aggregation Performance
+
+These queries test the performance of the `FIRST_VALUE` aggregation function with different data types and grouping cardinalities.
+
+| Query | `FIRST_VALUE` Column | Column Type | Group By Column | Group By Type | Number of Groups |
+|-------|----------------------|-------------|-----------------|---------------|------------------|
+| Q9    | `URL`                | `Utf8`      | `UserID`        | `Int64`       | 17,630,976       |
+| Q10    | `URL`                | `Utf8`      | `OS`            | `Int16`       | 91               |
+| Q11   | `WatchID`            | `Int64`     | `UserID`        | `Int64`       | 17,630,976       |
+| Q12   | `WatchID`            | `Int64`     | `OS`            | `Int16`       | 91               |
+
+
+
+
+
 ## Data Notes
 
 Here are some interesting statistics about the data used in the queries
diff --git a/benchmarks/queries/clickbench/extended/q10.sql b/benchmarks/queries/clickbench/extended/q10.sql
@@ -0,0 +1,8 @@
+-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
+-- set datafusion.execution.parquet.binary_as_string = true
+
+SELECT MAX(len) FROM (
+    SELECT LENGTH(FIRST_VALUE("URL" ORDER BY "EventTime")) as len
+    FROM hits
+    GROUP BY "OS"
+);
diff --git a/benchmarks/queries/clickbench/extended/q11.sql b/benchmarks/queries/clickbench/extended/q11.sql
@@ -0,0 +1,8 @@
+-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
+-- set datafusion.execution.parquet.binary_as_string = true
+
+SELECT MAX(fv) FROM (
+    SELECT FIRST_VALUE("WatchID" ORDER BY "EventTime") as fv
+    FROM hits
+    GROUP BY "UserID"
+);
diff --git a/benchmarks/queries/clickbench/extended/q12.sql b/benchmarks/queries/clickbench/extended/q12.sql
@@ -0,0 +1,8 @@
+-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
+-- set datafusion.execution.parquet.binary_as_string = true
+
+SELECT MAX(fv) FROM (
+    SELECT FIRST_VALUE("WatchID" ORDER BY "EventTime") as fv
+    FROM hits
+    GROUP BY "OS"
+);
diff --git a/benchmarks/queries/clickbench/extended/q8.sql b/benchmarks/queries/clickbench/extended/q8.sql
@@ -0,0 +1,4 @@
+-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
+-- set datafusion.execution.parquet.binary_as_string = true
+
+SELECT "RegionID", "UserAgent", "OS", AVG(to_timestamp("ResponseEndTiming")-to_timestamp("ResponseStartTiming")) as avg_response_time, AVG(to_timestamp("ResponseEndTiming")-to_timestamp("ConnectTiming")) as avg_latency FROM hits GROUP BY "RegionID", "UserAgent", "OS" ORDER BY avg_latency DESC limit 10;
diff --git a/benchmarks/queries/clickbench/extended/q9.sql b/benchmarks/queries/clickbench/extended/q9.sql
@@ -0,0 +1,8 @@
+-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
+-- set datafusion.execution.parquet.binary_as_string = true
+
+SELECT MAX(len) FROM (
+    SELECT LENGTH(FIRST_VALUE("URL" ORDER BY "EventTime")) as len
+    FROM hits
+    GROUP BY "UserID"
+);
diff --git a/datafusion/functions-aggregate/src/first_last.rs b/datafusion/functions-aggregate/src/first_last.rs
diff --git a/datafusion/functions-aggregate/src/first_last/state.rs b/datafusion/functions-aggregate/src/first_last/state.rs
diff --git a/datafusion/sqllogictest/test_files/aggregate.slt b/datafusion/sqllogictest/test_files/aggregate.slt