Commit 97172e2
authored
## Which issue does this PR close?
- Closes #21441.
## Rationale for this change
This PR makes two distinct optimizations to the `left` and `right`
builtin UDFs:
1. The `left` and `right` built-in UDFs have a zero-copy path for
`Utf8View` input, but they always copy for `Utf8` and `LargeUtf8`
inputs. If we make these functions always return `Utf8View`, we can add
a zero-copy path for `Utf8` and `LargeUtf8` paths as well. We can't take
this path in the case when the largest offset in the input string array
is > 4GB, but that is rare. This follows the recent optimization for
`substr` (#21366)
2. In the code path that handles `Utf8View` input, we were constructing
the return value via `StringViewArray::try_new`, which does some fairly
expensive validation. We know the return value is correct by
construction, so we can use `StringViewArray::new_unchecked` instead.
Benchmarks (ARM64):
```
- left/string short_result: 179.6µs → 127.1µs (-29.2%)
- left/string long_result: 324.3µs → 262.2µs (-19.1%)
- left/string_view short_result: 220.9µs → 122.5µs (-44.5%)
- left/string_view long_result: 383.1µs → 212.0µs (-44.7%)
- right/string short_result: 180.4µs → 126.0µs (-30.2%)
- right/string long_result: 392.0µs → 343.9µs (-12.3%)
- right/string_view short_result: 228.7µs → 125.3µs (-45.2%)
- right/string_view long_result: 393.6µs → 238.0µs (-39.5%)
```
## What changes are included in this PR?
* Update benchmarks to measure both inline and out-of-line string
results
* Change `left` and `right` return types to be `Utf8View`
* Optimize `left` and `right` string array path to do zero-copy when
possible
* Optimize `left` and `right` string view path, and refactor it to be
more similar to the array path
* Add more SLT tests to cover modified code paths
* Update various test expectations to reflect the new return type
## Are these changes tested?
Yes; benchmarked and new tests added.
## Are there any user-facing changes?
The return value of these functions have changed. This shouldn't
typically break any user logic, although it might result in the planner
inserting or removing casts for downstream operators, and the
performance of downstream operators might either be better or worse,
depending on whether the downstream code is better suited for `Utf8` or
`Utf8View` string representations.
1 parent 769e214 commit 97172e2
5 files changed
Lines changed: 344 additions & 192 deletions
File tree
- datafusion
- functions
- benches
- src/unicode
- sqllogictest/test_files/string
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
26 | | - | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
| 32 | + | |
| 33 | + | |
31 | 34 | | |
32 | | - | |
33 | 35 | | |
34 | | - | |
| 36 | + | |
35 | 37 | | |
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
39 | | - | |
| 41 | + | |
40 | 42 | | |
41 | 43 | | |
42 | 44 | | |
43 | | - | |
| 45 | + | |
44 | 46 | | |
45 | 47 | | |
46 | 48 | | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
62 | | - | |
63 | | - | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
64 | 68 | | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
| 69 | + | |
| 70 | + | |
72 | 71 | | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
98 | | - | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
99 | 78 | | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
119 | 91 | | |
120 | | - | |
121 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
122 | 107 | | |
123 | 108 | | |
| 109 | + | |
| 110 | + | |
124 | 111 | | |
125 | 112 | | |
126 | 113 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
22 | | - | |
| 21 | + | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
| |||
130 | 131 | | |
131 | 132 | | |
132 | 133 | | |
133 | | - | |
| 134 | + | |
134 | 135 | | |
135 | 136 | | |
136 | 137 | | |
137 | 138 | | |
138 | 139 | | |
139 | | - | |
| 140 | + | |
140 | 141 | | |
141 | 142 | | |
142 | 143 | | |
143 | | - | |
| 144 | + | |
144 | 145 | | |
145 | 146 | | |
146 | 147 | | |
| |||
150 | 151 | | |
151 | 152 | | |
152 | 153 | | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
161 | 171 | | |
162 | | - | |
163 | | - | |
164 | | - | |
165 | | - | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
173 | | - | |
174 | | - | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
175 | 210 | | |
176 | | - | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
177 | 227 | | |
178 | 228 | | |
179 | | - | |
| 229 | + | |
180 | 230 | | |
181 | 231 | | |
182 | 232 | | |
183 | | - | |
184 | | - | |
185 | | - | |
| 233 | + | |
186 | 234 | | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | | - | |
193 | | - | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
194 | 238 | | |
195 | 239 | | |
196 | 240 | | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
201 | | - | |
202 | | - | |
203 | | - | |
204 | | - | |
205 | | - | |
206 | | - | |
207 | | - | |
208 | | - | |
209 | | - | |
210 | | - | |
211 | | - | |
212 | | - | |
213 | | - | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | | - | |
| 241 | + | |
| 242 | + | |
221 | 243 | | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
222 | 258 | | |
223 | 259 | | |
224 | 260 | | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
231 | | - | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
232 | 275 | | |
0 commit comments