Skip to content

perf: Optimize substr for Utf8, LargeUtf8#21366

Merged
Jefffrey merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-substr-zerocopy
Apr 17, 2026
Merged

perf: Optimize substr for Utf8, LargeUtf8#21366
Jefffrey merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-substr-zerocopy

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

@neilconway neilconway commented Apr 4, 2026

Which issue does this PR close?

Rationale for this change

For Utf8 and LargeUtf8 inputs, we can optimize substr to avoid copying the output strings; instead, we can return a StringViewArray that points into the input value buffer.

Benchmarks (M4 Max):

no count, short strings (size=1024):
- string_view: 5.97 µs -> 5.96 µs (-0.2%)
- string: 7.80 µs -> 4.99 µs (-36.1%)
- large_string: 8.47 µs -> 4.90 µs (-42.2%)

no count, short strings (size=4096):
- string_view: 23.10 µs -> 22.90 µs (-0.9%)
- string: 31.24 µs -> 18.31 µs (-41.4%)
- large_string: 34.10 µs -> 17.70 µs (-48.1%)

with count, long strings (size=1024, count=64, strlen=128):
- string_view: 10.16 µs -> 10.79 µs (+6.2%)
- string: 11.90 µs -> 8.38 µs (-29.6%)
- large_string: 11.93 µs -> 8.30 µs (-30.5%)

with count, long strings (size=4096, count=64, strlen=128):
- string_view: 39.37 µs -> 38.79 µs (-1.5%)
- string: 46.22 µs -> 30.25 µs (-34.6%)
- large_string: 46.57 µs -> 30.49 µs (-34.5%)

short count, long strings (size=1024, count=6, strlen=128):
- string_view: 11.65 µs -> 11.57 µs (-0.7%)
- string: 14.97 µs -> 11.37 µs (-24.1%)
- large_string: 14.92 µs -> 11.37 µs (-23.8%)

short count, long strings (size=4096, count=6, strlen=128):
- string_view: 45.88 µs -> 43.82 µs (-4.5%)
- string: 58.38 µs -> 43.55 µs (-25.4%)
- large_string: 58.59 µs -> 43.58 µs (-25.6%)

scalar start, no count, short strings (size=1024, strlen=12):
- string_view: 6.07 µs -> 6.10 µs (+0.5%)
- string: 7.81 µs -> 5.06 µs (-35.2%)

scalar start, no count, short strings (size=4096, strlen=12):
- string_view: 23.08 µs -> 22.62 µs (-2.0%)
- string: 31.07 µs -> 18.86 µs (-39.3%)

scalar start, no count, long strings (size=1024, strlen=128):
- string_view: 9.99 µs -> 10.65 µs (+6.6%)
- string: 12.01 µs -> 8.17 µs (-32.0%)

scalar start, no count, long strings (size=4096, strlen=128):
- string_view: 38.57 µs -> 39.79 µs (+3.2%)
- string: 46.83 µs -> 31.67 µs (-32.4%)

scalar start=1, no count, long strings (size=1024, strlen=128):
- string_view: 9.78 µs -> 10.48 µs (+7.2%)
- string: 12.02 µs -> 8.16 µs (-32.1%)

scalar start=1, no count, long strings (size=4096, strlen=128):
- string_view: 38.54 µs -> 40.18 µs (+4.3%)
- string: 46.36 µs -> 31.73 µs (-31.6%)

scalar args, short strings (size=1024, count=6, strlen=12):
- string_view: 11.30 µs -> 11.23 µs (-0.7%)
- string: 15.04 µs -> 11.52 µs (-23.4%)

scalar args, short strings (size=4096, count=6, strlen=12):
- string_view: 44.34 µs -> 43.98 µs (-0.8%)
- string: 59.63 µs -> 45.02 µs (-24.5%)

scalar args, long strings (size=1024, count=64, strlen=128):
- string_view: 10.51 µs -> 12.05 µs (+14.6%)
- string: 12.21 µs -> 8.67 µs (-28.9%)
- large_string: 12.20 µs -> 8.66 µs (-29.0%)

scalar args, long strings (size=4096, count=64, strlen=128):
- string_view: 40.13 µs -> 41.89 µs (+4.4%)
- string: 46.96 µs -> 32.44 µs (-30.9%)
- large_string: 47.24 µs -> 32.49 µs (-31.2%)

This PR doesn't modify the string_view code path; I've included the benchmark results above for completeness, but any changes should just be benchmarking noise.

What changes are included in this PR?

  • Implement optimization
  • Other minor code cleanup
  • Add a benchmark (only somewhat related to this optimization but related to future optimization work)

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the functions Changes to functions implementation label Apr 4, 2026
Comment thread datafusion/functions/src/unicode/substr.rs
Copy link
Copy Markdown
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor comments . Thank you @neilconway

Comment thread datafusion/functions/src/unicode/substr.rs Outdated
Comment thread datafusion/functions/src/unicode/substr.rs
Comment thread datafusion/functions/src/unicode/substr.rs Outdated
@neilconway
Copy link
Copy Markdown
Contributor Author

@coderfender Thanks for the review! Please let me know if you have any more feedback.

Copy link
Copy Markdown
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments (in prev cycle) but those could be good followups

Comment thread datafusion/functions/src/unicode/substr.rs Outdated
Comment thread datafusion/functions/src/unicode/substr.rs Outdated
@neilconway
Copy link
Copy Markdown
Contributor Author

@Jefffrey Thanks for the review! I addressed your comments.

github-merge-queue Bot pushed a commit that referenced this pull request Apr 17, 2026
## Which issue does this PR close?

- Closes #21441.

## Rationale for this change

This PR makes two distinct optimizations to the `left` and `right`
builtin UDFs:

1. The `left` and `right` built-in UDFs have a zero-copy path for
`Utf8View` input, but they always copy for `Utf8` and `LargeUtf8`
inputs. If we make these functions always return `Utf8View`, we can add
a zero-copy path for `Utf8` and `LargeUtf8` paths as well. We can't take
this path in the case when the largest offset in the input string array
is > 4GB, but that is rare. This follows the recent optimization for
`substr` (#21366)
2. In the code path that handles `Utf8View` input, we were constructing
the return value via `StringViewArray::try_new`, which does some fairly
expensive validation. We know the return value is correct by
construction, so we can use `StringViewArray::new_unchecked` instead.

Benchmarks (ARM64):
```
  - left/string short_result: 179.6µs → 127.1µs (-29.2%)
  - left/string long_result: 324.3µs → 262.2µs (-19.1%)
  - left/string_view short_result: 220.9µs → 122.5µs (-44.5%)
  - left/string_view long_result: 383.1µs → 212.0µs (-44.7%)
  - right/string short_result: 180.4µs → 126.0µs (-30.2%)
  - right/string long_result: 392.0µs → 343.9µs (-12.3%)
  - right/string_view short_result: 228.7µs → 125.3µs (-45.2%)
  - right/string_view long_result: 393.6µs → 238.0µs (-39.5%)
```

## What changes are included in this PR?

* Update benchmarks to measure both inline and out-of-line string
results
* Change `left` and `right` return types to be `Utf8View`
* Optimize `left` and `right` string array path to do zero-copy when
possible
* Optimize `left` and `right` string view path, and refactor it to be
more similar to the array path
* Add more SLT tests to cover modified code paths
* Update various test expectations to reflect the new return type

## Are these changes tested?

Yes; benchmarked and new tests added.

## Are there any user-facing changes?

The return value of these functions have changed. This shouldn't
typically break any user logic, although it might result in the planner
inserting or removing casts for downstream operators, and the
performance of downstream operators might either be better or worse,
depending on whether the downstream code is better suited for `Utf8` or
`Utf8View` string representations.
@Jefffrey Jefffrey added this pull request to the merge queue Apr 17, 2026
Merged via the queue into apache:main with commit 5a427cb Apr 17, 2026
53 of 54 checks passed
@Jefffrey
Copy link
Copy Markdown
Contributor

Thanks @neilconway & @coderfender

@neilconway neilconway deleted the neilc/optimize-substr-zerocopy branch April 18, 2026 12:19
Rich-T-kid pushed a commit to Rich-T-kid/datafusion that referenced this pull request Apr 21, 2026
## Which issue does this PR close?

- Closes apache#21441.

## Rationale for this change

This PR makes two distinct optimizations to the `left` and `right`
builtin UDFs:

1. The `left` and `right` built-in UDFs have a zero-copy path for
`Utf8View` input, but they always copy for `Utf8` and `LargeUtf8`
inputs. If we make these functions always return `Utf8View`, we can add
a zero-copy path for `Utf8` and `LargeUtf8` paths as well. We can't take
this path in the case when the largest offset in the input string array
is > 4GB, but that is rare. This follows the recent optimization for
`substr` (apache#21366)
2. In the code path that handles `Utf8View` input, we were constructing
the return value via `StringViewArray::try_new`, which does some fairly
expensive validation. We know the return value is correct by
construction, so we can use `StringViewArray::new_unchecked` instead.

Benchmarks (ARM64):
```
  - left/string short_result: 179.6µs → 127.1µs (-29.2%)
  - left/string long_result: 324.3µs → 262.2µs (-19.1%)
  - left/string_view short_result: 220.9µs → 122.5µs (-44.5%)
  - left/string_view long_result: 383.1µs → 212.0µs (-44.7%)
  - right/string short_result: 180.4µs → 126.0µs (-30.2%)
  - right/string long_result: 392.0µs → 343.9µs (-12.3%)
  - right/string_view short_result: 228.7µs → 125.3µs (-45.2%)
  - right/string_view long_result: 393.6µs → 238.0µs (-39.5%)
```

## What changes are included in this PR?

* Update benchmarks to measure both inline and out-of-line string
results
* Change `left` and `right` return types to be `Utf8View`
* Optimize `left` and `right` string array path to do zero-copy when
possible
* Optimize `left` and `right` string view path, and refactor it to be
more similar to the array path
* Add more SLT tests to cover modified code paths
* Update various test expectations to reflect the new return type

## Are these changes tested?

Yes; benchmarked and new tests added.

## Are there any user-facing changes?

The return value of these functions have changed. This shouldn't
typically break any user logic, although it might result in the planner
inserting or removing casts for downstream operators, and the
performance of downstream operators might either be better or worse,
depending on whether the downstream code is better suited for `Utf8` or
`Utf8View` string representations.
Rich-T-kid pushed a commit to Rich-T-kid/datafusion that referenced this pull request Apr 21, 2026
## Which issue does this PR close?

- Closes apache#21364.

## Rationale for this change

For `Utf8` and `LargeUtf8` inputs, we can optimize `substr` to avoid
copying the output strings; instead, we can return a `StringViewArray`
that points into the input value buffer.

Benchmarks (M4 Max):

no count, short strings (size=1024):
    - string_view:  5.97 µs -> 5.96 µs (-0.2%)
    - string:       7.80 µs -> 4.99 µs (-36.1%)
    - large_string: 8.47 µs -> 4.90 µs (-42.2%)

  no count, short strings (size=4096):
    - string_view:  23.10 µs -> 22.90 µs (-0.9%)
    - string:       31.24 µs -> 18.31 µs (-41.4%)
    - large_string: 34.10 µs -> 17.70 µs (-48.1%)

  with count, long strings (size=1024, count=64, strlen=128):
    - string_view:  10.16 µs -> 10.79 µs (+6.2%)
    - string:       11.90 µs -> 8.38 µs (-29.6%)
    - large_string: 11.93 µs -> 8.30 µs (-30.5%)

  with count, long strings (size=4096, count=64, strlen=128):
    - string_view:  39.37 µs -> 38.79 µs (-1.5%)
    - string:       46.22 µs -> 30.25 µs (-34.6%)
    - large_string: 46.57 µs -> 30.49 µs (-34.5%)

  short count, long strings (size=1024, count=6, strlen=128):
    - string_view:  11.65 µs -> 11.57 µs (-0.7%)
    - string:       14.97 µs -> 11.37 µs (-24.1%)
    - large_string: 14.92 µs -> 11.37 µs (-23.8%)

  short count, long strings (size=4096, count=6, strlen=128):
    - string_view:  45.88 µs -> 43.82 µs (-4.5%)
    - string:       58.38 µs -> 43.55 µs (-25.4%)
    - large_string: 58.59 µs -> 43.58 µs (-25.6%)

  scalar start, no count, short strings (size=1024, strlen=12):
    - string_view:  6.07 µs -> 6.10 µs (+0.5%)
    - string:       7.81 µs -> 5.06 µs (-35.2%)

  scalar start, no count, short strings (size=4096, strlen=12):
    - string_view:  23.08 µs -> 22.62 µs (-2.0%)
    - string:       31.07 µs -> 18.86 µs (-39.3%)

  scalar start, no count, long strings (size=1024, strlen=128):
    - string_view:  9.99 µs -> 10.65 µs (+6.6%)
    - string:       12.01 µs -> 8.17 µs (-32.0%)

  scalar start, no count, long strings (size=4096, strlen=128):
    - string_view:  38.57 µs -> 39.79 µs (+3.2%)
    - string:       46.83 µs -> 31.67 µs (-32.4%)

  scalar start=1, no count, long strings (size=1024, strlen=128):
    - string_view:  9.78 µs -> 10.48 µs (+7.2%)
    - string:       12.02 µs -> 8.16 µs (-32.1%)

  scalar start=1, no count, long strings (size=4096, strlen=128):
    - string_view:  38.54 µs -> 40.18 µs (+4.3%)
    - string:       46.36 µs -> 31.73 µs (-31.6%)

  scalar args, short strings (size=1024, count=6, strlen=12):
    - string_view:  11.30 µs -> 11.23 µs (-0.7%)
    - string:       15.04 µs -> 11.52 µs (-23.4%)

  scalar args, short strings (size=4096, count=6, strlen=12):
    - string_view:  44.34 µs -> 43.98 µs (-0.8%)
    - string:       59.63 µs -> 45.02 µs (-24.5%)

  scalar args, long strings (size=1024, count=64, strlen=128):
    - string_view:  10.51 µs -> 12.05 µs (+14.6%)
    - string:       12.21 µs -> 8.67 µs (-28.9%)
    - large_string: 12.20 µs -> 8.66 µs (-29.0%)

  scalar args, long strings (size=4096, count=64, strlen=128):
    - string_view:  40.13 µs -> 41.89 µs (+4.4%)
    - string:       46.96 µs -> 32.44 µs (-30.9%)
    - large_string: 47.24 µs -> 32.49 µs (-31.2%)

This PR doesn't modify the `string_view` code path; I've included the
benchmark results above for completeness, but any changes should just be
benchmarking noise.

## What changes are included in this PR?

* Implement optimization
* Other minor code cleanup
* Add a benchmark (only somewhat related to this optimization but
related to future optimization work)

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize substr() to avoid copying for Utf8, LargeUtf8

3 participants