perf: Optimize substr for Utf8, LargeUtf8#21366
Merged
Jefffrey merged 7 commits intoapache:mainfrom Apr 17, 2026
Merged
Conversation
coderfender
reviewed
Apr 7, 2026
coderfender
reviewed
Apr 7, 2026
Contributor
coderfender
left a comment
There was a problem hiding this comment.
Left minor comments . Thank you @neilconway
Contributor
Author
|
@coderfender Thanks for the review! Please let me know if you have any more feedback. |
coderfender
approved these changes
Apr 8, 2026
Contributor
coderfender
left a comment
There was a problem hiding this comment.
Left some comments (in prev cycle) but those could be good followups
Jefffrey
reviewed
Apr 16, 2026
Contributor
Author
|
@Jefffrey Thanks for the review! I addressed your comments. |
Jefffrey
approved these changes
Apr 17, 2026
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Apr 17, 2026
## Which issue does this PR close? - Closes #21441. ## Rationale for this change This PR makes two distinct optimizations to the `left` and `right` builtin UDFs: 1. The `left` and `right` built-in UDFs have a zero-copy path for `Utf8View` input, but they always copy for `Utf8` and `LargeUtf8` inputs. If we make these functions always return `Utf8View`, we can add a zero-copy path for `Utf8` and `LargeUtf8` paths as well. We can't take this path in the case when the largest offset in the input string array is > 4GB, but that is rare. This follows the recent optimization for `substr` (#21366) 2. In the code path that handles `Utf8View` input, we were constructing the return value via `StringViewArray::try_new`, which does some fairly expensive validation. We know the return value is correct by construction, so we can use `StringViewArray::new_unchecked` instead. Benchmarks (ARM64): ``` - left/string short_result: 179.6µs → 127.1µs (-29.2%) - left/string long_result: 324.3µs → 262.2µs (-19.1%) - left/string_view short_result: 220.9µs → 122.5µs (-44.5%) - left/string_view long_result: 383.1µs → 212.0µs (-44.7%) - right/string short_result: 180.4µs → 126.0µs (-30.2%) - right/string long_result: 392.0µs → 343.9µs (-12.3%) - right/string_view short_result: 228.7µs → 125.3µs (-45.2%) - right/string_view long_result: 393.6µs → 238.0µs (-39.5%) ``` ## What changes are included in this PR? * Update benchmarks to measure both inline and out-of-line string results * Change `left` and `right` return types to be `Utf8View` * Optimize `left` and `right` string array path to do zero-copy when possible * Optimize `left` and `right` string view path, and refactor it to be more similar to the array path * Add more SLT tests to cover modified code paths * Update various test expectations to reflect the new return type ## Are these changes tested? Yes; benchmarked and new tests added. ## Are there any user-facing changes? The return value of these functions have changed. This shouldn't typically break any user logic, although it might result in the planner inserting or removing casts for downstream operators, and the performance of downstream operators might either be better or worse, depending on whether the downstream code is better suited for `Utf8` or `Utf8View` string representations.
Contributor
|
Thanks @neilconway & @coderfender |
Rich-T-kid
pushed a commit
to Rich-T-kid/datafusion
that referenced
this pull request
Apr 21, 2026
## Which issue does this PR close? - Closes apache#21441. ## Rationale for this change This PR makes two distinct optimizations to the `left` and `right` builtin UDFs: 1. The `left` and `right` built-in UDFs have a zero-copy path for `Utf8View` input, but they always copy for `Utf8` and `LargeUtf8` inputs. If we make these functions always return `Utf8View`, we can add a zero-copy path for `Utf8` and `LargeUtf8` paths as well. We can't take this path in the case when the largest offset in the input string array is > 4GB, but that is rare. This follows the recent optimization for `substr` (apache#21366) 2. In the code path that handles `Utf8View` input, we were constructing the return value via `StringViewArray::try_new`, which does some fairly expensive validation. We know the return value is correct by construction, so we can use `StringViewArray::new_unchecked` instead. Benchmarks (ARM64): ``` - left/string short_result: 179.6µs → 127.1µs (-29.2%) - left/string long_result: 324.3µs → 262.2µs (-19.1%) - left/string_view short_result: 220.9µs → 122.5µs (-44.5%) - left/string_view long_result: 383.1µs → 212.0µs (-44.7%) - right/string short_result: 180.4µs → 126.0µs (-30.2%) - right/string long_result: 392.0µs → 343.9µs (-12.3%) - right/string_view short_result: 228.7µs → 125.3µs (-45.2%) - right/string_view long_result: 393.6µs → 238.0µs (-39.5%) ``` ## What changes are included in this PR? * Update benchmarks to measure both inline and out-of-line string results * Change `left` and `right` return types to be `Utf8View` * Optimize `left` and `right` string array path to do zero-copy when possible * Optimize `left` and `right` string view path, and refactor it to be more similar to the array path * Add more SLT tests to cover modified code paths * Update various test expectations to reflect the new return type ## Are these changes tested? Yes; benchmarked and new tests added. ## Are there any user-facing changes? The return value of these functions have changed. This shouldn't typically break any user logic, although it might result in the planner inserting or removing casts for downstream operators, and the performance of downstream operators might either be better or worse, depending on whether the downstream code is better suited for `Utf8` or `Utf8View` string representations.
Rich-T-kid
pushed a commit
to Rich-T-kid/datafusion
that referenced
this pull request
Apr 21, 2026
## Which issue does this PR close? - Closes apache#21364. ## Rationale for this change For `Utf8` and `LargeUtf8` inputs, we can optimize `substr` to avoid copying the output strings; instead, we can return a `StringViewArray` that points into the input value buffer. Benchmarks (M4 Max): no count, short strings (size=1024): - string_view: 5.97 µs -> 5.96 µs (-0.2%) - string: 7.80 µs -> 4.99 µs (-36.1%) - large_string: 8.47 µs -> 4.90 µs (-42.2%) no count, short strings (size=4096): - string_view: 23.10 µs -> 22.90 µs (-0.9%) - string: 31.24 µs -> 18.31 µs (-41.4%) - large_string: 34.10 µs -> 17.70 µs (-48.1%) with count, long strings (size=1024, count=64, strlen=128): - string_view: 10.16 µs -> 10.79 µs (+6.2%) - string: 11.90 µs -> 8.38 µs (-29.6%) - large_string: 11.93 µs -> 8.30 µs (-30.5%) with count, long strings (size=4096, count=64, strlen=128): - string_view: 39.37 µs -> 38.79 µs (-1.5%) - string: 46.22 µs -> 30.25 µs (-34.6%) - large_string: 46.57 µs -> 30.49 µs (-34.5%) short count, long strings (size=1024, count=6, strlen=128): - string_view: 11.65 µs -> 11.57 µs (-0.7%) - string: 14.97 µs -> 11.37 µs (-24.1%) - large_string: 14.92 µs -> 11.37 µs (-23.8%) short count, long strings (size=4096, count=6, strlen=128): - string_view: 45.88 µs -> 43.82 µs (-4.5%) - string: 58.38 µs -> 43.55 µs (-25.4%) - large_string: 58.59 µs -> 43.58 µs (-25.6%) scalar start, no count, short strings (size=1024, strlen=12): - string_view: 6.07 µs -> 6.10 µs (+0.5%) - string: 7.81 µs -> 5.06 µs (-35.2%) scalar start, no count, short strings (size=4096, strlen=12): - string_view: 23.08 µs -> 22.62 µs (-2.0%) - string: 31.07 µs -> 18.86 µs (-39.3%) scalar start, no count, long strings (size=1024, strlen=128): - string_view: 9.99 µs -> 10.65 µs (+6.6%) - string: 12.01 µs -> 8.17 µs (-32.0%) scalar start, no count, long strings (size=4096, strlen=128): - string_view: 38.57 µs -> 39.79 µs (+3.2%) - string: 46.83 µs -> 31.67 µs (-32.4%) scalar start=1, no count, long strings (size=1024, strlen=128): - string_view: 9.78 µs -> 10.48 µs (+7.2%) - string: 12.02 µs -> 8.16 µs (-32.1%) scalar start=1, no count, long strings (size=4096, strlen=128): - string_view: 38.54 µs -> 40.18 µs (+4.3%) - string: 46.36 µs -> 31.73 µs (-31.6%) scalar args, short strings (size=1024, count=6, strlen=12): - string_view: 11.30 µs -> 11.23 µs (-0.7%) - string: 15.04 µs -> 11.52 µs (-23.4%) scalar args, short strings (size=4096, count=6, strlen=12): - string_view: 44.34 µs -> 43.98 µs (-0.8%) - string: 59.63 µs -> 45.02 µs (-24.5%) scalar args, long strings (size=1024, count=64, strlen=128): - string_view: 10.51 µs -> 12.05 µs (+14.6%) - string: 12.21 µs -> 8.67 µs (-28.9%) - large_string: 12.20 µs -> 8.66 µs (-29.0%) scalar args, long strings (size=4096, count=64, strlen=128): - string_view: 40.13 µs -> 41.89 µs (+4.4%) - string: 46.96 µs -> 32.44 µs (-30.9%) - large_string: 47.24 µs -> 32.49 µs (-31.2%) This PR doesn't modify the `string_view` code path; I've included the benchmark results above for completeness, but any changes should just be benchmarking noise. ## What changes are included in this PR? * Implement optimization * Other minor code cleanup * Add a benchmark (only somewhat related to this optimization but related to future optimization work) ## Are these changes tested? Yes. ## Are there any user-facing changes? No.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
For
Utf8andLargeUtf8inputs, we can optimizesubstrto avoid copying the output strings; instead, we can return aStringViewArraythat points into the input value buffer.Benchmarks (M4 Max):
no count, short strings (size=1024):
- string_view: 5.97 µs -> 5.96 µs (-0.2%)
- string: 7.80 µs -> 4.99 µs (-36.1%)
- large_string: 8.47 µs -> 4.90 µs (-42.2%)
no count, short strings (size=4096):
- string_view: 23.10 µs -> 22.90 µs (-0.9%)
- string: 31.24 µs -> 18.31 µs (-41.4%)
- large_string: 34.10 µs -> 17.70 µs (-48.1%)
with count, long strings (size=1024, count=64, strlen=128):
- string_view: 10.16 µs -> 10.79 µs (+6.2%)
- string: 11.90 µs -> 8.38 µs (-29.6%)
- large_string: 11.93 µs -> 8.30 µs (-30.5%)
with count, long strings (size=4096, count=64, strlen=128):
- string_view: 39.37 µs -> 38.79 µs (-1.5%)
- string: 46.22 µs -> 30.25 µs (-34.6%)
- large_string: 46.57 µs -> 30.49 µs (-34.5%)
short count, long strings (size=1024, count=6, strlen=128):
- string_view: 11.65 µs -> 11.57 µs (-0.7%)
- string: 14.97 µs -> 11.37 µs (-24.1%)
- large_string: 14.92 µs -> 11.37 µs (-23.8%)
short count, long strings (size=4096, count=6, strlen=128):
- string_view: 45.88 µs -> 43.82 µs (-4.5%)
- string: 58.38 µs -> 43.55 µs (-25.4%)
- large_string: 58.59 µs -> 43.58 µs (-25.6%)
scalar start, no count, short strings (size=1024, strlen=12):
- string_view: 6.07 µs -> 6.10 µs (+0.5%)
- string: 7.81 µs -> 5.06 µs (-35.2%)
scalar start, no count, short strings (size=4096, strlen=12):
- string_view: 23.08 µs -> 22.62 µs (-2.0%)
- string: 31.07 µs -> 18.86 µs (-39.3%)
scalar start, no count, long strings (size=1024, strlen=128):
- string_view: 9.99 µs -> 10.65 µs (+6.6%)
- string: 12.01 µs -> 8.17 µs (-32.0%)
scalar start, no count, long strings (size=4096, strlen=128):
- string_view: 38.57 µs -> 39.79 µs (+3.2%)
- string: 46.83 µs -> 31.67 µs (-32.4%)
scalar start=1, no count, long strings (size=1024, strlen=128):
- string_view: 9.78 µs -> 10.48 µs (+7.2%)
- string: 12.02 µs -> 8.16 µs (-32.1%)
scalar start=1, no count, long strings (size=4096, strlen=128):
- string_view: 38.54 µs -> 40.18 µs (+4.3%)
- string: 46.36 µs -> 31.73 µs (-31.6%)
scalar args, short strings (size=1024, count=6, strlen=12):
- string_view: 11.30 µs -> 11.23 µs (-0.7%)
- string: 15.04 µs -> 11.52 µs (-23.4%)
scalar args, short strings (size=4096, count=6, strlen=12):
- string_view: 44.34 µs -> 43.98 µs (-0.8%)
- string: 59.63 µs -> 45.02 µs (-24.5%)
scalar args, long strings (size=1024, count=64, strlen=128):
- string_view: 10.51 µs -> 12.05 µs (+14.6%)
- string: 12.21 µs -> 8.67 µs (-28.9%)
- large_string: 12.20 µs -> 8.66 µs (-29.0%)
scalar args, long strings (size=4096, count=64, strlen=128):
- string_view: 40.13 µs -> 41.89 µs (+4.4%)
- string: 46.96 µs -> 32.44 µs (-30.9%)
- large_string: 47.24 µs -> 32.49 µs (-31.2%)
This PR doesn't modify the
string_viewcode path; I've included the benchmark results above for completeness, but any changes should just be benchmarking noise.What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
No.