Skip to content

fix: FilterPushdown incorrectly remaps filters through ProjectionExec with duplicate column names#21247

Open
yashrb24 wants to merge 1 commit intoapache:mainfrom
yashrb24:fix/filter-pushdown-duplicate-columns
Open

fix: FilterPushdown incorrectly remaps filters through ProjectionExec with duplicate column names#21247
yashrb24 wants to merge 1 commit intoapache:mainfrom
yashrb24:fix/filter-pushdown-duplicate-columns

Conversation

@yashrb24
Copy link
Copy Markdown

@yashrb24 yashrb24 commented Mar 30, 2026

Which issue does this PR close?

Rationale for this change

ProjectionExec::gather_filters_for_pushdown silently rewrites filter predicates to the wrong source column when the output schema contains duplicate column names — a structure that arises above joins where both sides share a column name. Two functions use name-only schema lookups (column_with_name and index_of) that always return the first match, which is incorrect when duplicate names exist:

  1. collect_reverse_alias — HashMap key collision causes the second duplicate to overwrite the first.
  2. FilterRemapper::try_remapindex_of silently rewrites column indices from non-first duplicates to position 0.

This code path is not exercised through normal SQL because the logical optimizer's PushDownFilter resolves qualified column references and pushes filters below projections before the physical plan is created. However, it affects any direct construction of physical plans.

What changes are included in this PR?

  1. collect_reverse_alias: Use enumerate() index instead of column_with_name(). Projection expressions are positionally aligned with the output schema, so idx is the correct output column index.

  2. gather_filters_for_pushdown: Replace FilterRemapper::try_remap (which uses index_of) with direct validation against the alias map's exact (name, index) keys. The PhysicalColumnRewriter already does an exact-key lookup, so try_remap was both redundant for this case.

Are these changes tested?

Yes. A regression test is added that constructs the exact physical plan structure triggering the bug (FilterExec → ProjectionExec with duplicate column names → HashJoinExec), runs the FilterPushdown optimizer, and verifies the optimized plan returns correct results (3 rows instead of the previous 0).

Are there any user-facing changes?

No API changes. Fixes incorrect query results for physical plans with duplicate column names in projections.

@github-actions github-actions Bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Mar 30, 2026
@yashrb24 yashrb24 force-pushed the fix/filter-pushdown-duplicate-columns branch from 9bca0ef to 5a28beb Compare April 23, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FilterPushdown physical optimizer incorrectly remaps filters through ProjectionExec with duplicate column names

2 participants