perf: Optimize `substr` for Utf8, LargeUtf8 by neilconway · Pull Request #21366 · apache/datafusion

neilconway · 2026-04-04T17:42:33Z

Which issue does this PR close?

Closes Optimize substr() to avoid copying for Utf8, LargeUtf8 #21364.

Rationale for this change

For Utf8 and LargeUtf8 inputs, we can optimize substr to avoid copying the output strings; instead, we can return a StringViewArray that points into the input value buffer.

Benchmarks (M4 Max):

no count, short strings (size=1024):
- string_view: 5.97 µs -> 5.96 µs (-0.2%)
- string: 7.80 µs -> 4.99 µs (-36.1%)
- large_string: 8.47 µs -> 4.90 µs (-42.2%)

no count, short strings (size=4096):
- string_view: 23.10 µs -> 22.90 µs (-0.9%)
- string: 31.24 µs -> 18.31 µs (-41.4%)
- large_string: 34.10 µs -> 17.70 µs (-48.1%)

with count, long strings (size=1024, count=64, strlen=128):
- string_view: 10.16 µs -> 10.79 µs (+6.2%)
- string: 11.90 µs -> 8.38 µs (-29.6%)
- large_string: 11.93 µs -> 8.30 µs (-30.5%)

with count, long strings (size=4096, count=64, strlen=128):
- string_view: 39.37 µs -> 38.79 µs (-1.5%)
- string: 46.22 µs -> 30.25 µs (-34.6%)
- large_string: 46.57 µs -> 30.49 µs (-34.5%)

short count, long strings (size=1024, count=6, strlen=128):
- string_view: 11.65 µs -> 11.57 µs (-0.7%)
- string: 14.97 µs -> 11.37 µs (-24.1%)
- large_string: 14.92 µs -> 11.37 µs (-23.8%)

short count, long strings (size=4096, count=6, strlen=128):
- string_view: 45.88 µs -> 43.82 µs (-4.5%)
- string: 58.38 µs -> 43.55 µs (-25.4%)
- large_string: 58.59 µs -> 43.58 µs (-25.6%)

scalar start, no count, short strings (size=1024, strlen=12):
- string_view: 6.07 µs -> 6.10 µs (+0.5%)
- string: 7.81 µs -> 5.06 µs (-35.2%)

scalar start, no count, short strings (size=4096, strlen=12):
- string_view: 23.08 µs -> 22.62 µs (-2.0%)
- string: 31.07 µs -> 18.86 µs (-39.3%)

scalar start, no count, long strings (size=1024, strlen=128):
- string_view: 9.99 µs -> 10.65 µs (+6.6%)
- string: 12.01 µs -> 8.17 µs (-32.0%)

scalar start, no count, long strings (size=4096, strlen=128):
- string_view: 38.57 µs -> 39.79 µs (+3.2%)
- string: 46.83 µs -> 31.67 µs (-32.4%)

scalar start=1, no count, long strings (size=1024, strlen=128):
- string_view: 9.78 µs -> 10.48 µs (+7.2%)
- string: 12.02 µs -> 8.16 µs (-32.1%)

scalar start=1, no count, long strings (size=4096, strlen=128):
- string_view: 38.54 µs -> 40.18 µs (+4.3%)
- string: 46.36 µs -> 31.73 µs (-31.6%)

scalar args, short strings (size=1024, count=6, strlen=12):
- string_view: 11.30 µs -> 11.23 µs (-0.7%)
- string: 15.04 µs -> 11.52 µs (-23.4%)

scalar args, short strings (size=4096, count=6, strlen=12):
- string_view: 44.34 µs -> 43.98 µs (-0.8%)
- string: 59.63 µs -> 45.02 µs (-24.5%)

scalar args, long strings (size=1024, count=64, strlen=128):
- string_view: 10.51 µs -> 12.05 µs (+14.6%)
- string: 12.21 µs -> 8.67 µs (-28.9%)
- large_string: 12.20 µs -> 8.66 µs (-29.0%)

scalar args, long strings (size=4096, count=64, strlen=128):
- string_view: 40.13 µs -> 41.89 µs (+4.4%)
- string: 46.96 µs -> 32.44 µs (-30.9%)
- large_string: 47.24 µs -> 32.49 µs (-31.2%)

This PR doesn't modify the string_view code path; I've included the benchmark results above for completeness, but any changes should just be benchmarking noise.

What changes are included in this PR?

Implement optimization
Other minor code cleanup
Add a benchmark (only somewhat related to this optimization but related to future optimization work)

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

coderfender

Left minor comments . Thank you @neilconway

neilconway · 2026-04-08T14:50:31Z

@coderfender Thanks for the review! Please let me know if you have any more feedback.

coderfender

Left some comments (in prev cycle) but those could be good followups

…-zerocopy

neilconway · 2026-04-16T15:40:56Z

@Jefffrey Thanks for the review! I addressed your comments.

## Which issue does this PR close? - Closes #21441. ## Rationale for this change This PR makes two distinct optimizations to the `left` and `right` builtin UDFs: 1. The `left` and `right` built-in UDFs have a zero-copy path for `Utf8View` input, but they always copy for `Utf8` and `LargeUtf8` inputs. If we make these functions always return `Utf8View`, we can add a zero-copy path for `Utf8` and `LargeUtf8` paths as well. We can't take this path in the case when the largest offset in the input string array is > 4GB, but that is rare. This follows the recent optimization for `substr` (#21366) 2. In the code path that handles `Utf8View` input, we were constructing the return value via `StringViewArray::try_new`, which does some fairly expensive validation. We know the return value is correct by construction, so we can use `StringViewArray::new_unchecked` instead. Benchmarks (ARM64): ``` - left/string short_result: 179.6µs → 127.1µs (-29.2%) - left/string long_result: 324.3µs → 262.2µs (-19.1%) - left/string_view short_result: 220.9µs → 122.5µs (-44.5%) - left/string_view long_result: 383.1µs → 212.0µs (-44.7%) - right/string short_result: 180.4µs → 126.0µs (-30.2%) - right/string long_result: 392.0µs → 343.9µs (-12.3%) - right/string_view short_result: 228.7µs → 125.3µs (-45.2%) - right/string_view long_result: 393.6µs → 238.0µs (-39.5%) ``` ## What changes are included in this PR? * Update benchmarks to measure both inline and out-of-line string results * Change `left` and `right` return types to be `Utf8View` * Optimize `left` and `right` string array path to do zero-copy when possible * Optimize `left` and `right` string view path, and refactor it to be more similar to the array path * Add more SLT tests to cover modified code paths * Update various test expectations to reflect the new return type ## Are these changes tested? Yes; benchmarked and new tests added. ## Are there any user-facing changes? The return value of these functions have changed. This shouldn't typically break any user logic, although it might result in the planner inserting or removing casts for downstream operators, and the performance of downstream operators might either be better or worse, depending on whether the downstream code is better suited for `Utf8` or `Utf8View` string representations.

Jefffrey · 2026-04-17T07:01:01Z

Thanks @neilconway & @coderfender

## Which issue does this PR close? - Closes apache#21441. ## Rationale for this change This PR makes two distinct optimizations to the `left` and `right` builtin UDFs: 1. The `left` and `right` built-in UDFs have a zero-copy path for `Utf8View` input, but they always copy for `Utf8` and `LargeUtf8` inputs. If we make these functions always return `Utf8View`, we can add a zero-copy path for `Utf8` and `LargeUtf8` paths as well. We can't take this path in the case when the largest offset in the input string array is > 4GB, but that is rare. This follows the recent optimization for `substr` (apache#21366) 2. In the code path that handles `Utf8View` input, we were constructing the return value via `StringViewArray::try_new`, which does some fairly expensive validation. We know the return value is correct by construction, so we can use `StringViewArray::new_unchecked` instead. Benchmarks (ARM64): ``` - left/string short_result: 179.6µs → 127.1µs (-29.2%) - left/string long_result: 324.3µs → 262.2µs (-19.1%) - left/string_view short_result: 220.9µs → 122.5µs (-44.5%) - left/string_view long_result: 383.1µs → 212.0µs (-44.7%) - right/string short_result: 180.4µs → 126.0µs (-30.2%) - right/string long_result: 392.0µs → 343.9µs (-12.3%) - right/string_view short_result: 228.7µs → 125.3µs (-45.2%) - right/string_view long_result: 393.6µs → 238.0µs (-39.5%) ``` ## What changes are included in this PR? * Update benchmarks to measure both inline and out-of-line string results * Change `left` and `right` return types to be `Utf8View` * Optimize `left` and `right` string array path to do zero-copy when possible * Optimize `left` and `right` string view path, and refactor it to be more similar to the array path * Add more SLT tests to cover modified code paths * Update various test expectations to reflect the new return type ## Are these changes tested? Yes; benchmarked and new tests added. ## Are there any user-facing changes? The return value of these functions have changed. This shouldn't typically break any user logic, although it might result in the planner inserting or removing casts for downstream operators, and the performance of downstream operators might either be better or worse, depending on whether the downstream code is better suited for `Utf8` or `Utf8View` string representations.

## Which issue does this PR close? - Closes apache#21364. ## Rationale for this change For `Utf8` and `LargeUtf8` inputs, we can optimize `substr` to avoid copying the output strings; instead, we can return a `StringViewArray` that points into the input value buffer. Benchmarks (M4 Max): no count, short strings (size=1024): - string_view: 5.97 µs -> 5.96 µs (-0.2%) - string: 7.80 µs -> 4.99 µs (-36.1%) - large_string: 8.47 µs -> 4.90 µs (-42.2%) no count, short strings (size=4096): - string_view: 23.10 µs -> 22.90 µs (-0.9%) - string: 31.24 µs -> 18.31 µs (-41.4%) - large_string: 34.10 µs -> 17.70 µs (-48.1%) with count, long strings (size=1024, count=64, strlen=128): - string_view: 10.16 µs -> 10.79 µs (+6.2%) - string: 11.90 µs -> 8.38 µs (-29.6%) - large_string: 11.93 µs -> 8.30 µs (-30.5%) with count, long strings (size=4096, count=64, strlen=128): - string_view: 39.37 µs -> 38.79 µs (-1.5%) - string: 46.22 µs -> 30.25 µs (-34.6%) - large_string: 46.57 µs -> 30.49 µs (-34.5%) short count, long strings (size=1024, count=6, strlen=128): - string_view: 11.65 µs -> 11.57 µs (-0.7%) - string: 14.97 µs -> 11.37 µs (-24.1%) - large_string: 14.92 µs -> 11.37 µs (-23.8%) short count, long strings (size=4096, count=6, strlen=128): - string_view: 45.88 µs -> 43.82 µs (-4.5%) - string: 58.38 µs -> 43.55 µs (-25.4%) - large_string: 58.59 µs -> 43.58 µs (-25.6%) scalar start, no count, short strings (size=1024, strlen=12): - string_view: 6.07 µs -> 6.10 µs (+0.5%) - string: 7.81 µs -> 5.06 µs (-35.2%) scalar start, no count, short strings (size=4096, strlen=12): - string_view: 23.08 µs -> 22.62 µs (-2.0%) - string: 31.07 µs -> 18.86 µs (-39.3%) scalar start, no count, long strings (size=1024, strlen=128): - string_view: 9.99 µs -> 10.65 µs (+6.6%) - string: 12.01 µs -> 8.17 µs (-32.0%) scalar start, no count, long strings (size=4096, strlen=128): - string_view: 38.57 µs -> 39.79 µs (+3.2%) - string: 46.83 µs -> 31.67 µs (-32.4%) scalar start=1, no count, long strings (size=1024, strlen=128): - string_view: 9.78 µs -> 10.48 µs (+7.2%) - string: 12.02 µs -> 8.16 µs (-32.1%) scalar start=1, no count, long strings (size=4096, strlen=128): - string_view: 38.54 µs -> 40.18 µs (+4.3%) - string: 46.36 µs -> 31.73 µs (-31.6%) scalar args, short strings (size=1024, count=6, strlen=12): - string_view: 11.30 µs -> 11.23 µs (-0.7%) - string: 15.04 µs -> 11.52 µs (-23.4%) scalar args, short strings (size=4096, count=6, strlen=12): - string_view: 44.34 µs -> 43.98 µs (-0.8%) - string: 59.63 µs -> 45.02 µs (-24.5%) scalar args, long strings (size=1024, count=64, strlen=128): - string_view: 10.51 µs -> 12.05 µs (+14.6%) - string: 12.21 µs -> 8.67 µs (-28.9%) - large_string: 12.20 µs -> 8.66 µs (-29.0%) scalar args, long strings (size=4096, count=64, strlen=128): - string_view: 40.13 µs -> 41.89 µs (+4.4%) - string: 46.96 µs -> 32.44 µs (-30.9%) - large_string: 47.24 µs -> 32.49 µs (-31.2%) This PR doesn't modify the `string_view` code path; I've included the benchmark results above for completeness, but any changes should just be benchmarking noise. ## What changes are included in this PR? * Implement optimization * Other minor code cleanup * Add a benchmark (only somewhat related to this optimization but related to future optimization work) ## Are these changes tested? Yes. ## Are there any user-facing changes? No.

neilconway added 2 commits April 4, 2026 13:33

.

b4e794d

.

b5fbbdd

github-actions Bot added the functions Changes to functions implementation label Apr 4, 2026

Tweak test case

a44aac8

coderfender reviewed Apr 7, 2026

View reviewed changes

Comment thread datafusion/functions/src/unicode/substr.rs

coderfender reviewed Apr 7, 2026

View reviewed changes

Comment thread datafusion/functions/src/unicode/substr.rs Outdated

Comment thread datafusion/functions/src/unicode/substr.rs

Comment thread datafusion/functions/src/unicode/substr.rs Outdated

neilconway mentioned this pull request Apr 7, 2026

perf: Optimize left, right to reduce copying #21442

Merged

coderfender approved these changes Apr 8, 2026

View reviewed changes

neilconway added 3 commits April 9, 2026 20:34

Merge remote-tracking branch 'origin/main' into neilc/optimize-substr…

7b58aa1

…-zerocopy

Minor fixes

e04f394

cargo fmt

89cf079

Jefffrey reviewed Apr 16, 2026

View reviewed changes

Comment thread datafusion/functions/src/unicode/substr.rs Outdated

Comment thread datafusion/functions/src/unicode/substr.rs Outdated

Address review comments

218605e

Jefffrey approved these changes Apr 17, 2026

View reviewed changes

Jefffrey added this pull request to the merge queue Apr 17, 2026

Merged via the queue into apache:main with commit 5a427cb Apr 17, 2026
53 of 54 checks passed

neilconway deleted the neilc/optimize-substr-zerocopy branch April 18, 2026 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize `substr` for Utf8, LargeUtf8#21366

perf: Optimize `substr` for Utf8, LargeUtf8#21366
Jefffrey merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-substr-zerocopy

neilconway commented Apr 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

coderfender left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neilconway commented Apr 8, 2026

Uh oh!

coderfender left a comment

Uh oh!

Uh oh!

Uh oh!

neilconway commented Apr 16, 2026

Uh oh!

Uh oh!

Jefffrey commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neilconway commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

coderfender left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neilconway commented Apr 8, 2026

Uh oh!

coderfender left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

neilconway commented Apr 16, 2026

Uh oh!

Uh oh!

Jefffrey commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neilconway commented Apr 4, 2026 •

edited

Loading