Skip to content

Fix fully matched row groups with null counts#21907

Merged
alamb merged 3 commits intoapache:mainfrom
xudong963:fix-fully-matched-null-counts
May 4, 2026
Merged

Fix fully matched row groups with null counts#21907
alamb merged 3 commits intoapache:mainfrom
xudong963:fix-fully-matched-null-counts

Conversation

@xudong963
Copy link
Copy Markdown
Member

@xudong963 xudong963 commented Apr 29, 2026

Which issue does this PR close?

Rationale for this change

This is split out from review feedback on #21637. Row groups can only be marked fully matched when all rows are guaranteed to pass the filter. For nullable predicate columns, proving NOT(predicate) is not enough because rows where the predicate evaluates to NULL do not pass the filter.

What changes are included in this PR?

This PR makes the fully matched row-group proof conservative for nulls by adding IS NULL checks for nullable columns referenced by the predicate before evaluating the inverted pruning predicate.

It also threads with_missing_null_counts_as_zero through RowGroupPruningStatistics so normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass.

Are these changes tested?

Added a regression test covering row groups with known nulls, known zero nulls, and missing null counts.

Are there any user-facing changes?

No API changes. This only prevents false positives in the row-group fully matched optimization.

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xudong963 -- this seems better than what is on main, but I am not quite sure about the decision to pick

            missing_null_counts_as_zero: false,

vs

            missing_null_counts_as_zero: true,

parquet_schema,
row_group_metadatas,
arrow_schema,
missing_null_counts_as_zero: true,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this value also be passed down rather than hard coded?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also threads with_missing_null_counts_as_zero through RowGroupPruningStatistics so normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass.

I think this is a setting on the reader that controls how missing statistics are interpreted (as older versions of arrow-rs didn't write null counts when there were 0 nulls)

I am not sure why this code path is changing its value

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the current DataFusion parquet reader options and I don’t think this is currently exposed as a reader-level DataFusion option (We can open another PR for it!). The existing behavior was implicitly the StatisticsConverter default, which treats missing null counts as zero.

So the regular pruning path keeps that existing behavior.

The fully matched path is intentionally stricter because it is proving every row passes the predicate. For that proof, missing null counts must remain unknown; otherwise nullable columns with missing null_count can be incorrectly marked fully matched.

I’ll add comments/named constructors to make this distinction clearer.

.map(|&i| &groups[i])
.collect::<Vec<_>>(),
arrow_schema,
missing_null_counts_as_zero: false,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this always false? Maybe it is worth some comments explaining what is going on here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments here ea255a6

@alamb alamb added this pull request to the merge queue May 4, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 4, 2026

Thanks @xudong963

Merged via the queue into apache:main with commit b5c481f May 4, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants