Skip to content

Commit 206ac6b

Browse files
adriangbclaude
andauthored
refactor(pruning): remove column param from PruningStatistics::row_counts (#21369)
## Which issue does this PR close? N/A — standalone API improvement, prerequisite for #21157. ## Rationale for this change `PruningStatistics::row_counts(&self, column: &Column)` takes a column parameter, but row counts are container-level (same for all columns). 8 of 11 implementations ignore the parameter with `_column`. The Parquet impl (`RowGroupPruningStatistics`) unnecessarily constructs a `StatisticsConverter` from the column just to call `row_group_row_counts()`, which doesn't use the column at all. The existing code even has a comment acknowledging this: > "row counts are the same for all columns in a row group" And a test comment: > "This is debatable, personally I think `row_count` should not take a `Column` as an argument at all since all columns should have the same number of rows." ## What changes are included in this PR? **Breaking change**: `fn row_counts(&self, column: &Column) -> Option<ArrayRef>` becomes `fn row_counts(&self) -> Option<ArrayRef>`. - Remove `column` parameter from trait definition and all 11 implementations - `RowGroupPruningStatistics`: read `num_rows()` directly from row group metadata instead of routing through `StatisticsConverter` - `PrunableStatistics`: remove column-exists validation (row count is container-level) - Update all call sites and tests ## Are these changes tested? Yes — all existing tests updated and passing. The behavior change is: - `row_counts()` on `PrunableStatistics` now returns data even for non-existent columns (correct, since row count is container-level) - `RowGroupPruningStatistics::row_counts()` always returns row counts (previously could fail if column wasn't in Parquet schema) ## Are there any user-facing changes? Yes — breaking change to `PruningStatistics` trait. Downstream implementations need to remove the `column` parameter from their `row_counts` method. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7fa7fe0 commit 206ac6b

7 files changed

Lines changed: 86 additions & 55 deletions

File tree

datafusion-examples/examples/data_io/parquet_index.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -462,7 +462,7 @@ impl PruningStatistics for ParquetMetadataIndex {
462462
}
463463

464464
/// return the row counts for each file
465-
fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
465+
fn row_counts(&self) -> Option<ArrayRef> {
466466
Some(self.row_counts_ref().clone())
467467
}
468468

datafusion-examples/examples/query_planning/pruning.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ impl PruningStatistics for MyCatalog {
174174
None
175175
}
176176

177-
fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
177+
fn row_counts(&self) -> Option<ArrayRef> {
178178
// In this example, we know nothing about the number of rows in each file
179179
None
180180
}

datafusion/common/src/pruning.rs

Lines changed: 34 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -95,15 +95,17 @@ pub trait PruningStatistics {
9595
/// [`UInt64Array`]: arrow::array::UInt64Array
9696
fn null_counts(&self, column: &Column) -> Option<ArrayRef>;
9797

98-
/// Return the number of rows for the named column in each container
99-
/// as an [`UInt64Array`].
98+
/// Return the number of rows in each container as an [`UInt64Array`].
99+
///
100+
/// Row counts are container-level (not column-specific) — the value
101+
/// is the same regardless of which column is being considered.
100102
///
101103
/// See [`Self::min_values`] for when to return `None` and null values.
102104
///
103105
/// Note: the returned array must contain [`Self::num_containers`] rows
104106
///
105107
/// [`UInt64Array`]: arrow::array::UInt64Array
106-
fn row_counts(&self, column: &Column) -> Option<ArrayRef>;
108+
fn row_counts(&self) -> Option<ArrayRef>;
107109

108110
/// Returns [`BooleanArray`] where each row represents information known
109111
/// about specific literal `values` in a column.
@@ -265,7 +267,7 @@ impl PruningStatistics for PartitionPruningStatistics {
265267
None
266268
}
267269

268-
fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
270+
fn row_counts(&self) -> Option<ArrayRef> {
269271
None
270272
}
271273

@@ -398,11 +400,7 @@ impl PruningStatistics for PrunableStatistics {
398400
}
399401
}
400402

401-
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
402-
// If the column does not exist in the schema, return None
403-
if self.schema.index_of(column.name()).is_err() {
404-
return None;
405-
}
403+
fn row_counts(&self) -> Option<ArrayRef> {
406404
if self
407405
.statistics
408406
.iter()
@@ -502,9 +500,9 @@ impl PruningStatistics for CompositePruningStatistics {
502500
None
503501
}
504502

505-
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
503+
fn row_counts(&self) -> Option<ArrayRef> {
506504
for stats in &self.statistics {
507-
if let Some(array) = stats.row_counts(column) {
505+
if let Some(array) = stats.row_counts() {
508506
return Some(array);
509507
}
510508
}
@@ -566,9 +564,9 @@ mod tests {
566564

567565
// Partition values don't know anything about nulls or row counts
568566
assert!(partition_stats.null_counts(&column_a).is_none());
569-
assert!(partition_stats.row_counts(&column_a).is_none());
567+
assert!(partition_stats.row_counts().is_none());
570568
assert!(partition_stats.null_counts(&column_b).is_none());
571-
assert!(partition_stats.row_counts(&column_b).is_none());
569+
assert!(partition_stats.row_counts().is_none());
572570

573571
// Min/max values are the same as the partition values
574572
let min_values_a =
@@ -709,9 +707,9 @@ mod tests {
709707

710708
// Partition values don't know anything about nulls or row counts
711709
assert!(partition_stats.null_counts(&column_a).is_none());
712-
assert!(partition_stats.row_counts(&column_a).is_none());
710+
assert!(partition_stats.row_counts().is_none());
713711
assert!(partition_stats.null_counts(&column_b).is_none());
714-
assert!(partition_stats.row_counts(&column_b).is_none());
712+
assert!(partition_stats.row_counts().is_none());
715713

716714
// Min/max values are all missing
717715
assert!(partition_stats.min_values(&column_a).is_none());
@@ -814,13 +812,13 @@ mod tests {
814812
assert_eq!(null_counts_b, expected_null_counts_b);
815813

816814
// Row counts are the same as the statistics
817-
let row_counts_a = as_uint64_array(&pruning_stats.row_counts(&column_a).unwrap())
815+
let row_counts_a = as_uint64_array(&pruning_stats.row_counts().unwrap())
818816
.unwrap()
819817
.into_iter()
820818
.collect::<Vec<_>>();
821819
let expected_row_counts_a = vec![Some(100), Some(200)];
822820
assert_eq!(row_counts_a, expected_row_counts_a);
823-
let row_counts_b = as_uint64_array(&pruning_stats.row_counts(&column_b).unwrap())
821+
let row_counts_b = as_uint64_array(&pruning_stats.row_counts().unwrap())
824822
.unwrap()
825823
.into_iter()
826824
.collect::<Vec<_>>();
@@ -845,20 +843,21 @@ mod tests {
845843
// This is debatable, personally I think `row_count` should not take a `Column` as an argument
846844
// at all since all columns should have the same number of rows.
847845
// But for now we just document the current behavior in this test.
848-
let row_counts_c = as_uint64_array(&pruning_stats.row_counts(&column_c).unwrap())
846+
let row_counts_c = as_uint64_array(&pruning_stats.row_counts().unwrap())
849847
.unwrap()
850848
.into_iter()
851849
.collect::<Vec<_>>();
852850
let expected_row_counts_c = vec![Some(100), Some(200)];
853851
assert_eq!(row_counts_c, expected_row_counts_c);
854852
assert!(pruning_stats.contained(&column_c, &values).is_none());
855853

856-
// Test with a column that doesn't exist
854+
// Test with a column that doesn't exist — column-specific stats
855+
// return None, but row_counts is container-level and still available
857856
let column_d = Column::new_unqualified("d");
858857
assert!(pruning_stats.min_values(&column_d).is_none());
859858
assert!(pruning_stats.max_values(&column_d).is_none());
860859
assert!(pruning_stats.null_counts(&column_d).is_none());
861-
assert!(pruning_stats.row_counts(&column_d).is_none());
860+
assert!(pruning_stats.row_counts().is_some());
862861
assert!(pruning_stats.contained(&column_d, &values).is_none());
863862
}
864863

@@ -886,8 +885,8 @@ mod tests {
886885
assert!(pruning_stats.null_counts(&column_b).is_none());
887886

888887
// Row counts are all missing
889-
assert!(pruning_stats.row_counts(&column_a).is_none());
890-
assert!(pruning_stats.row_counts(&column_b).is_none());
888+
assert!(pruning_stats.row_counts().is_none());
889+
assert!(pruning_stats.row_counts().is_none());
891890

892891
// Contained values are all empty
893892
let values = HashSet::from([ScalarValue::from(1i32)]);
@@ -1027,13 +1026,11 @@ mod tests {
10271026
let expected_null_counts_col_x = vec![Some(0), Some(10)];
10281027
assert_eq!(null_counts_col_x, expected_null_counts_col_x);
10291028

1030-
// Test row counts - only available from file statistics
1031-
assert!(composite_stats.row_counts(&part_a).is_none());
1032-
let row_counts_col_x =
1033-
as_uint64_array(&composite_stats.row_counts(&col_x).unwrap())
1034-
.unwrap()
1035-
.into_iter()
1036-
.collect::<Vec<_>>();
1029+
// Test row counts — container-level, available from file statistics
1030+
let row_counts_col_x = as_uint64_array(&composite_stats.row_counts().unwrap())
1031+
.unwrap()
1032+
.into_iter()
1033+
.collect::<Vec<_>>();
10371034
let expected_row_counts = vec![Some(100), Some(200)];
10381035
assert_eq!(row_counts_col_x, expected_row_counts);
10391036

@@ -1046,12 +1043,13 @@ mod tests {
10461043
// File statistics don't implement contained
10471044
assert!(composite_stats.contained(&col_x, &values).is_none());
10481045

1049-
// Non-existent column should return None for everything
1046+
// Non-existent column should return None for column-specific stats,
1047+
// but row_counts is container-level and still available
10501048
let non_existent = Column::new_unqualified("non_existent");
10511049
assert!(composite_stats.min_values(&non_existent).is_none());
10521050
assert!(composite_stats.max_values(&non_existent).is_none());
10531051
assert!(composite_stats.null_counts(&non_existent).is_none());
1054-
assert!(composite_stats.row_counts(&non_existent).is_none());
1052+
assert!(composite_stats.row_counts().is_some());
10551053
assert!(composite_stats.contained(&non_existent, &values).is_none());
10561054

10571055
// Verify num_containers matches
@@ -1155,7 +1153,7 @@ mod tests {
11551153
let expected_null_counts = vec![Some(0), Some(5)];
11561154
assert_eq!(null_counts, expected_null_counts);
11571155

1158-
let row_counts = as_uint64_array(&composite_stats.row_counts(&col_a).unwrap())
1156+
let row_counts = as_uint64_array(&composite_stats.row_counts().unwrap())
11591157
.unwrap()
11601158
.into_iter()
11611159
.collect::<Vec<_>>();
@@ -1195,11 +1193,10 @@ mod tests {
11951193
let expected_null_counts = vec![Some(10), Some(20)];
11961194
assert_eq!(null_counts, expected_null_counts);
11971195

1198-
let row_counts =
1199-
as_uint64_array(&composite_stats_reversed.row_counts(&col_a).unwrap())
1200-
.unwrap()
1201-
.into_iter()
1202-
.collect::<Vec<_>>();
1196+
let row_counts = as_uint64_array(&composite_stats_reversed.row_counts().unwrap())
1197+
.unwrap()
1198+
.into_iter()
1199+
.collect::<Vec<_>>();
12031200
let expected_row_counts = vec![Some(1000), Some(2000)];
12041201
assert_eq!(row_counts, expected_row_counts);
12051202
}

datafusion/datasource-parquet/src/page_filter.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -509,7 +509,7 @@ impl PruningStatistics for PagesPruningStatistics<'_> {
509509
}
510510
}
511511

512-
fn row_counts(&self, _column: &datafusion_common::Column) -> Option<ArrayRef> {
512+
fn row_counts(&self) -> Option<ArrayRef> {
513513
match self.converter.data_page_row_counts(
514514
self.offset_index,
515515
self.row_group_metadatas,

datafusion/datasource-parquet/src/row_group_filter.rs

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ use std::collections::{HashMap, HashSet};
1919
use std::sync::Arc;
2020

2121
use super::{ParquetAccessPlan, ParquetFileMetrics};
22-
use arrow::array::{ArrayRef, BooleanArray};
22+
use arrow::array::{ArrayRef, BooleanArray, UInt64Array};
2323
use arrow::datatypes::Schema;
2424
use datafusion_common::pruning::PruningStatistics;
2525
use datafusion_common::{Column, Result, ScalarValue};
@@ -536,7 +536,7 @@ impl PruningStatistics for BloomFilterStatistics {
536536
None
537537
}
538538

539-
fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
539+
fn row_counts(&self) -> Option<ArrayRef> {
540540
None
541541
}
542542

@@ -626,13 +626,13 @@ impl PruningStatistics for RowGroupPruningStatistics<'_> {
626626
.map(|counts| Arc::new(counts) as ArrayRef)
627627
}
628628

629-
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
630-
// row counts are the same for all columns in a row group
631-
self.statistics_converter(column)
632-
.and_then(|c| Ok(c.row_group_row_counts(self.metadata_iter())?))
633-
.ok()
634-
.flatten()
635-
.map(|counts| Arc::new(counts) as ArrayRef)
629+
fn row_counts(&self) -> Option<ArrayRef> {
630+
// Row counts are container-level — read directly from row group metadata.
631+
let counts: UInt64Array = self
632+
.metadata_iter()
633+
.map(|rg| Some(rg.num_rows() as u64))
634+
.collect();
635+
Some(Arc::new(counts) as ArrayRef)
636636
}
637637

638638
fn contained(

datafusion/pruning/src/pruning_predicate.rs

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -929,7 +929,7 @@ fn build_statistics_record_batch<S: PruningStatistics + ?Sized>(
929929
StatisticsType::Min => statistics.min_values(&column),
930930
StatisticsType::Max => statistics.max_values(&column),
931931
StatisticsType::NullCount => statistics.null_counts(&column),
932-
StatisticsType::RowCount => statistics.row_counts(&column),
932+
StatisticsType::RowCount => statistics.row_counts(),
933933
};
934934
let array = array.unwrap_or_else(|| new_null_array(data_type, num_containers));
935935

@@ -2300,11 +2300,10 @@ mod tests {
23002300
.unwrap_or(None)
23012301
}
23022302

2303-
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
2303+
fn row_counts(&self) -> Option<ArrayRef> {
23042304
self.stats
2305-
.get(column)
2306-
.map(|container_stats| container_stats.row_counts())
2307-
.unwrap_or(None)
2305+
.values()
2306+
.find_map(|container_stats| container_stats.row_counts())
23082307
}
23092308

23102309
fn contained(
@@ -2342,7 +2341,7 @@ mod tests {
23422341
None
23432342
}
23442343

2345-
fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
2344+
fn row_counts(&self) -> Option<ArrayRef> {
23462345
None
23472346
}
23482347

docs/source/library-user-guide/upgrading/54.0.0.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,41 @@ auto-derefs through the `Arc`).
289289
> always return `None`. Use the `downcast_ref` method above instead, or
290290
> dereference through the `Arc` first with `plan.as_ref() as &dyn Any`.
291291
292+
### `PruningStatistics::row_counts` no longer takes a `column` parameter
293+
294+
The `row_counts` method on the `PruningStatistics` trait no longer takes a
295+
`&Column` argument, since row counts are a container-level property (the same
296+
for every column).
297+
298+
**Before:**
299+
300+
```rust,ignore
301+
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
302+
// ...
303+
}
304+
```
305+
306+
**After:**
307+
308+
```rust,ignore
309+
fn row_counts(&self) -> Option<ArrayRef> {
310+
// ...
311+
}
312+
```
313+
314+
**Who is affected:**
315+
316+
- Users who implement the `PruningStatistics` trait
317+
318+
**Migration guide:**
319+
320+
Remove the `column: &Column` parameter from your `row_counts` implementation
321+
and any corresponding call sites. If your implementation was using the column
322+
argument, note that row counts are identical for all columns in a container, so
323+
the parameter was unnecessary.
324+
325+
See [PR #21369](https://github.com/apache/datafusion/pull/21369) for details.
326+
292327
### Avro API and timestamp decoding changes
293328

294329
DataFusion has switched to use `arrow-avro` (see [#17861]) when reading avro files

0 commit comments

Comments
 (0)