perf: selective column concat in hash join build side by SubhamSinghal · Pull Request #21735 · apache/datafusion

SubhamSinghal · 2026-04-19T17:58:05Z

Which issue does this PR close?

related to: #18942

Rationale for this change

In CollectLeft hash joins, concat_batches copies all columns from the build side into a single RecordBatch, even when only a subset is needed for the join output, filter evaluation, and key computation. For wide tables (20+ columns), this wastes significant memory and CPU. Savings = (total_columns - needed_columns) / total_columns
For ex:
For a 20-column table needing 3 columns: skips 85% of the copy
For a 10-column table needing 8 columns: skips 20% of the copy
For a 5-column table needing 5 columns: skips 0% (short-circuit)
This PR projects build-side batches to only the needed columns before concat_batches, reducing both peak memory and copy time.

What changes are included in this PR?

datafusion/physical-plan/src/joins/hash_join/exec.rs:

compute_build_side_projection() — determines which build-side columns are actually needed (union of output columns, filter columns, and join key expression columns)
remap_column_indices() — translates original column indices to projected positions
evaluate_and_concat_per_batch() — evaluates join key expressions per-batch before projection, then concatenates result arrays (only used when projection is active)
Modified collect_left_input() and try_create_array_map() to project batches before concat_batches when a column subset suffices
Added build_column_remap field to JoinLeftData to carry the remap table downstream

datafusion/physical-plan/src/joins/hash_join/stream.rs:

In collect_build_side(), remaps column_indices and filter column_indices when build-side projection is active

Are these changes tested?

Yes — covered by existing tests

No new tests added since this is an internal optimization that doesn't change observable behavior. The existing test suite covers all join types, partition modes, filter combinations, empty build sides, and outer join unmatched-row handling.

Are there any user-facing changes?

No.

Subham Singhal and others added 2 commits April 19, 2026 23:26

perf: selective column concat in hash join build side

e1b348a

Merge branch 'main' into selective-column-concat-hash-join

2fd3f61

github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: selective column concat in hash join build side#21735

perf: selective column concat in hash join build side#21735
SubhamSinghal wants to merge 2 commits intoapache:mainfrom
SubhamSinghal:selective-column-concat-hash-join

SubhamSinghal commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SubhamSinghal commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SubhamSinghal commented Apr 19, 2026 •

edited

Loading