Skip to content

perf: selective column concat in hash join build side#21735

Open
SubhamSinghal wants to merge 2 commits intoapache:mainfrom
SubhamSinghal:selective-column-concat-hash-join
Open

perf: selective column concat in hash join build side#21735
SubhamSinghal wants to merge 2 commits intoapache:mainfrom
SubhamSinghal:selective-column-concat-hash-join

Conversation

@SubhamSinghal
Copy link
Copy Markdown
Contributor

@SubhamSinghal SubhamSinghal commented Apr 19, 2026

Which issue does this PR close?

related to: #18942

Rationale for this change

In CollectLeft hash joins, concat_batches copies all columns from the build side into a single RecordBatch, even when only a subset is needed for the join output, filter evaluation, and key computation. For wide tables (20+ columns), this wastes significant memory and CPU. Savings = (total_columns - needed_columns) / total_columns
For ex:
For a 20-column table needing 3 columns: skips 85% of the copy
For a 10-column table needing 8 columns: skips 20% of the copy
For a 5-column table needing 5 columns: skips 0% (short-circuit)
This PR projects build-side batches to only the needed columns before concat_batches, reducing both peak memory and copy time.

What changes are included in this PR?

datafusion/physical-plan/src/joins/hash_join/exec.rs:

  • compute_build_side_projection() — determines which build-side columns are actually needed (union of output columns, filter columns, and join key expression columns)
  • remap_column_indices() — translates original column indices to projected positions
  • evaluate_and_concat_per_batch() — evaluates join key expressions per-batch before projection, then concatenates result arrays (only used when projection is active)
  • Modified collect_left_input() and try_create_array_map() to project batches before concat_batches when a column subset suffices
  • Added build_column_remap field to JoinLeftData to carry the remap table downstream

datafusion/physical-plan/src/joins/hash_join/stream.rs:

  • In collect_build_side(), remaps column_indices and filter column_indices when build-side projection is active

Are these changes tested?

Yes — covered by existing tests

No new tests added since this is an internal optimization that doesn't change observable behavior. The existing test suite covers all join types, partition modes, filter combinations, empty build sides, and outer join unmatched-row handling.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant