perf(physical-plan): optimize byte view append#21586
perf(physical-plan): optimize byte view append#21586kumarUjjawal wants to merge 3 commits intoapache:mainfrom
Conversation
| } | ||
|
|
||
| let start_idx = self.views.len(); | ||
| self.views.extend(rows.iter().map(|&row| source_views[row])); |
There was a problem hiding this comment.
This could be moved at line 202 and this will re-use the iteration over rows.
| let start_idx = self.views.len(); | ||
| self.views.extend(rows.iter().map(|&row| source_views[row])); | ||
|
|
||
| let mut pending = Vec::with_capacity(rows.len()); |
There was a problem hiding this comment.
This most probably will over-allocate because only the rows with more than 12 bytes will be appended. Maybe use half of the rows length as initial capacity and let it grow if needed ?!
| rows: &[usize], | ||
| ) { | ||
| let source_views = array.views(); | ||
| let mut pending = Vec::with_capacity(rows.len()); |
There was a problem hiding this comment.
Same - consider using a smaller initial capacity.
There was a problem hiding this comment.
I think half or rows.len() would be good as you said.
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/vectorized_append (bd0a959) to 98d280f (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/vectorized_append (bd0a959) to 98d280f (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/vectorized_append (bd0a959) to 98d280f (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
Thanks for this @kumarUjjawal and @martin-g
Do we have benchmarks that show it making the operation faster? The bot benchmark runner results seem mixed
I am also surprised that this would be faster given how much extra work it is doing (allocating another buffer, and a bunch of calculations in batch_copy_long_views) 🤔
| for (idx, &row) in rows.iter().enumerate() { | ||
| let view = source_views[row]; | ||
| self.views.push(view); | ||
| if (view as u32) > 12 { |
There was a problem hiding this comment.
we culd replace 12 with MAX_INLINE_VIEW_LEN maybe to make it clearer what is going on here
| } | ||
|
|
||
| let start_idx = self.views.len(); | ||
| let mut pending = Vec::with_capacity(rows.len().saturating_add(1) / 2); |
There was a problem hiding this comment.
You could probably save this allocation by using some sort of temp scratch space on ByteViewGroupValueBuilder
Something lik
pub struct ByteViewGroupValueBuilder<B: ByteViewType> {
...
scratch: Vec<PendingByteViewCopy>,}
So then it wouldn't reallocate on each batch
There was a problem hiding this comment.
I don't think we need to have this scratch space
|
|
||
| const BYTE_VIEW_MAX_BLOCK_SIZE: usize = 2 * 1024 * 1024; | ||
|
|
||
| #[derive(Clone, Copy)] |
There was a problem hiding this comment.
I think it would help to add some explaining what this is and what it is used for so a future reader doesn't have to read the rest of the code to figure it out
| self.batch_copy_long_views(array.data_buffers(), &pending); | ||
| } | ||
|
|
||
| fn batch_copy_long_views( |
There was a problem hiding this comment.
it would help me to understand this if it has some more comments explaining the rationale for postponing the copy (to detect contiguous views?)
| let mut pending = Vec::with_capacity(rows.len().saturating_add(1) / 2); | ||
| self.views.reserve(rows.len()); | ||
|
|
||
| for &row in rows { |
There was a problem hiding this comment.
We should use extend more in this code path which reduces capacity checks / improves code gen.
It also reduces reallocations
Dandandan
left a comment
There was a problem hiding this comment.
I think we probably need some good benchmark first / or show improvement on the aggregation benchmarks.
|
@Dandandan Can you run the benchmark |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/vectorized_append (1fed6b5) to 98d280f (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/vectorized_append (1fed6b5) to 98d280f (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/vectorized_append (1fed6b5) to 98d280f (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
ByteViewGroupValueBuilder::vectorized_appenddid extraper-value work. It rebuilt views and copied long values one by one even when the input already had reusable byte views.What changes are included in this PR?
vectorized_appendextendon the no-null fast pathtake_nafter vectorized appendAre these changes tested?
Yes
Are there any user-facing changes?
No