Skip to content

Query-aware statistics requests via ScanArgs / ScanResult (RFC for #21624)#21996

Draft
adriangb wants to merge 2 commits intoapache:mainfrom
pydantic:worktree-stats-request-api
Draft

Query-aware statistics requests via ScanArgs / ScanResult (RFC for #21624)#21996
adriangb wants to merge 2 commits intoapache:mainfrom
pydantic:worktree-stats-request-api

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 3, 2026

Which issue does this PR close?

Rationale for this change

Today, statistics flow through DataFusion as an all-or-nothing dense Statistics struct: collect_statistics=true reads parquet thrift footers for every column of every file, allocates a Vec<ColumnStatistics> of length num_columns per file, and stores it whether the query references those columns or not. On wide tables that's a lot of memory and IO that the query may not need.

This is even worse for 3rd party TableProvider implementations that may store statistics in an external catalog or Parquet files in object store (think Delta/Iceberg, a Postgres backed table, etc). These implementations could efficiently pull statistics for 1-2 columns but are forced to pull all statistics since they don't know which ones DataFusion wants leading to a lot of wasted work.

This PR proposes a small handshake on the existing scan_with_args API that lets a caller ask a TableProvider for specific stats by name, and lets the provider answer only what it can deliver cheaply. The shape is intentionally minimal — enough to unblock memory and third-party wins on its own, with room to grow toward sketches, histograms, and selectivity stats later.

What changes are included in this PR?

The change is split into two stacked commits.

My goal was to implement the APIs, statistics collection and FilePruner as the first consumer. Porting all other consumers is a larger piece of work that I think should be done across multiple PRs.

Commit 1: catalog: query-aware statistics requests via ScanArgs / ScanResult

The API surface itself.

New types in datafusion-expr-common::statistics

pub enum StatisticsRequest {
    Min(Column), Max(Column),
    NullCount(Column), DistinctCount(Column),
    Sum(Column), ByteSize(Column),
    RowCount, TotalByteSize,
}

pub enum StatisticsValue {
    Scalar(Precision<ScalarValue>),
    Distribution(Box<Distribution>),  // typed; reuses existing enum
    Absent,
}

pub type SatisfiedStatistics =
    HashMap<StatisticsRequest, StatisticsValue>;

The variants of StatisticsRequest mirror the fields of Statistics
/ ColumnStatistics, so a provider that already populates one can
answer the other trivially. Whether a value is exact or estimated
travels in the Precision wrapper, not in the request kind itself.

ScanArgs / ScanResult extension

ScanArgs::with_statistics_requests(requests: &[StatisticsRequest])
ScanArgs::statistics_requests() -> &[StatisticsRequest]
ScanResult::with_statistics(statistics: Vec<StatisticsValue>)
ScanResult::statistics() -> &[StatisticsValue]

The contract: "answer what's free, leave the rest as Absent."

Per-file sparse stats

PartitionedFile.satisfied_stats: Option<Arc<SatisfiedStatistics>>
PartitionedFile::with_satisfied_stats(...)

Memory scales with what was asked for, not with table width.

FilePruner consumes the sparse map

FilePruner learns to dispatch between the existing PrunableStatistics
(dense) and a new SparseFilePruningStats adapter that does direct
request-keyed lookups against the sparse map. No densify-then-throw-away.

ListingTable answers from footer metadata

When a caller passes with_statistics_requests(...), scan_with_args
populates ScanResult.statistics from the merged dense Statistics it
already touched. Min/Max/NullCount/RowCount/TotalByteSize come back as
the Precision value the format produced. DistinctCount/Sum/ByteSize
come back as Absent for parquet — those aren't in thrift footers.

Commit 2: optimizer: derive StatisticsRequests from logical plan

The producer side: now the planner actually asks.

  • TableScan gains statistics_requests: Vec<StatisticsRequest> and a
    with_statistics_requests builder.

  • New RequestStatistics OptimizerRule runs last in the default
    pipeline. Walks the optimized plan once, derives:

    node requests
    Sort Min / Max / NullCount on each sort key
    Filter Min / Max / NullCount / DistinctCount on referenced cols
    Join DistinctCount / NullCount on join keys (both sides)
    always RowCount per scan

    Idempotent; never reshapes the plan, only annotates TableScans.

  • DefaultPhysicalPlanner reads scan.statistics_requests and threads
    them into ScanArgs::with_statistics_requests when calling
    provider.scan_with_args.

End to end: SQL → optimizer → TableScan.statistics_requests
physical planner → provider.scan_with_args(ScanArgs{statistics_requests, ..})
→ provider returns whatever it can on ScanResult.statistics
caller (today: nothing built-in; tomorrow: a layered stats helper or
optimizer rule) consumes.

Future work

Advanced stats not available from Parquet footers

In designing this change I wanted to keep in mind some of the other goals discussed in #21624.
In particular, I think we should develop a story for being able to collect more advanced planning statistics ad-hoc when they are not available.

I was inspired by Floe SQL's CMU talk.
The idea would be:

  1. From the logical plan we derive statistics that we think we'll need.
  2. We pass them into the TableProvider implementations (this PR) and they answer what they can. E.g. if your metadata store keeps track of most common values for a column, congrats! Populate it from there. But if you have a bunch of random Parquet files just leave it blank.
  3. Now we have a physical plan + a bunch of missing stats we want. Maybe we augment this by asking the physical optimizers what stats they want, etc. But now we do something similar to what Floe does: we run a mini query to sample data and collect statistics. Some things that make this tractable which I've prototyped:
    a) File -> Row group -> Row Chunk sampling. Basically a semi-efficient way to say "get me a random sample of 1% of the data". I started POC work on this in feat: TABLESAMPLE SYSTEM end-to-end + row-group / row sampling on ParquetSource #22000 which also stands on it's own.
    b) Time and row bounded scanning. This sampled data is a stream that flows into aggregators that compute stats. We put a time (say 150ms) and row number (say 100k) limit on this stream. This guarantees that our worst case cost is constant. If we bail out early we can emit our best estimate up to this point or leave the stats unpopulated.
    We can also play all sorts of games with timing, e.g. we can start the query and the stats collection and race them. If the query finished before the stats collection (possible if e.g. there's a single file and opening the file is the dominating cost) then just cancel stats collection. If stats collection finishes first we can re-optimize the plan and if the new plan is "better" enough that we think it's worth the price we cancel and re-run the query. I imagine this might be a good config knob.

I also imagine there might be some hooks in the implementation of how to collect those stats. A default one we bundle might do the sampling from parquet files, but I can imagine users may want to hook in some catalog of stats they have if they deem it too expensive to run as part of TableProvider::scan_with_args, e.g. if they store a pre-sampled set of rows or just advanced statistics like sketches (that can't be represented by Parquet's simplistic stats model) as an embedded index.

Expression based statistics

Discussed in #20871. We represent struct field access in datafusion as an expression get_field(col, 'field'). It is not a column. Our existing statistics system cannot represent statistics for a struct field (only the top level struct, which is meaningless most of the time) even though Parquet carries stats for leaf columns of nested types.

This proposal would make it much easier to implement this, we would need to expand StatisticsRequest with Min(Expr)/Max(Expr), etc. We could fold Column into there or keep it as a special case. This mirrors what engines like Postgres can do (you can tell it to collect and store stats on expressions). Then a PartitionedFile can have file.stats.get(Min(Expr(get_field(col, 'field'))) -> StatisticsResult::Scalar(...)` (pseudo code to avoid UDF invocation ceremony).

Relationship to other PRs

There is a lot of work being done around statistics, I will mention just the ones that stand out in my mind.

This PR deals with 2 things:

  1. How do we decide and represent what statistics to collect?
  2. How do we get those statistics from metadata stores or our data (for Parquet)?

It does not deal with:

  1. Using those stats for pruning.
  2. Propagating stats through expressions or execution plans.

#19609

This PR deals with statistics propagation and statistics representation.
I don't think there is much conflict in propagation (the current PR does not deal with that) but the representation piece is interesting. I can see a world where the HashMap<StatisticsRequest, StatisticsResponse> is instead a RecordBatch + a wrapper that offers an API like stats.get([Min("column"), Max("other column")]) but knows how to look that up in the RecordBatch and produce a new RecordBatch as a result that can be used for vectorized operations on statics.

Right now I am more concerned with the IO / memory cost of stats and having APIs to be able to understand what stats are needed and request and represent advanced stats more so than I am with vectorized execution on stats. In my timings from #21968 (comment) stats computation (e.g. PruningPredicate) is not a major issue, it's stats collection that costs us a lot. I've seen similar things in our system where we store stats in Parquet files. I also have some trouble grasping how we are going to fit and what the advantages are of storing things like sketches or histograms in arrow arrays.

#21122

This work also deals with statistics propagation. It is orthogonal, but there is some overlap worth discussing. In particular in the section above about struct field stats I proposed we allow statistics requests to contain expressions. It then becomes unclear if, in the case of JOIN t2 ON coalesce(t1.col, '') = t2.col if we should use the APIs in this PR to request StatisticsRequest::Min(Expr(coalesce(t1.col, ''))) or if we should request stats on t1.col and use something like #21122 to propagate them.

I think this won't be that hard to resolve: the get_field functions are already special via

/// Returns placement information for this function.
///
/// This is used by optimizers to make decisions about expression placement,
/// such as whether to push expressions down through projections.
///
/// The default implementation returns [`ExpressionPlacement::KeepInPlace`],
/// meaning the expression should be kept where it is in the plan.
///
/// Override this method to indicate that the function can be pushed down
/// closer to the data source.
fn placement(&self, _args: &[ExpressionPlacement]) -> ExpressionPlacement {
ExpressionPlacement::KeepInPlace
}

Maybe we can use that existing API? Or we can come up with something new. But ultimately it seems resolvable to split expressions into ones we ask the data source for vs. derive.

Are there any user-facing changes?

API additions only, all opt-in:

  • ScanArgs / ScanResult gain new fields with Default-friendly initializers; existing callers that don't use the new builders see no change.
  • FilePruner's field-type change is a private internal field.

The only source-level break: PartitionedFile gains a new pub satisfied_stats: Option<...> field. Callers using PartitionedFile::new / From<ObjectMeta> / the existing builders are unaffected. Direct struct literals (uncommon, none in-tree) need to add satisfied_stats: None or migrate to the new with_satisfied_stats builder.

@github-actions github-actions Bot added catalog Related to the catalog crate common Related to common crate datasource Changes to the datasource crate labels May 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning origin/main
    Building datafusion v53.1.0 (current)
       Built [  82.621s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.033s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  80.185s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.034s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.636s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 165.513s] datafusion
    Building datafusion-catalog v53.1.0 (current)
       Built [  35.955s] (current)
     Parsing datafusion-catalog v53.1.0 (current)
      Parsed [   0.032s] (current)
    Building datafusion-catalog v53.1.0 (baseline)
       Built [  36.756s] (baseline)
     Parsing datafusion-catalog v53.1.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-catalog v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.131s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  74.257s] datafusion-catalog
    Building datafusion-catalog-listing v53.1.0 (current)
       Built [  41.187s] (current)
     Parsing datafusion-catalog-listing v53.1.0 (current)
      Parsed [   0.011s] (current)
    Building datafusion-catalog-listing v53.1.0 (baseline)
       Built [  40.929s] (baseline)
     Parsing datafusion-catalog-listing v53.1.0 (baseline)
      Parsed [   0.011s] (baseline)
    Checking datafusion-catalog-listing v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.087s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  83.735s] datafusion-catalog-listing
    Building datafusion-datasource v53.1.0 (current)
       Built [  35.308s] (current)
     Parsing datafusion-datasource v53.1.0 (current)
      Parsed [   0.028s] (current)
    Building datafusion-datasource v53.1.0 (baseline)
       Built [  35.283s] (baseline)
     Parsing datafusion-datasource v53.1.0 (baseline)
      Parsed [   0.029s] (baseline)
    Checking datafusion-datasource v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.252s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field PartitionedFile.satisfied_stats in /home/runner/work/datafusion/datafusion/datafusion/datasource/src/mod.rs:153

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  72.278s] datafusion-datasource
    Building datafusion-expr v53.1.0 (current)
       Built [  25.168s] (current)
     Parsing datafusion-expr v53.1.0 (current)
      Parsed [   0.069s] (current)
    Building datafusion-expr v53.1.0 (baseline)
       Built [  25.208s] (baseline)
     Parsing datafusion-expr v53.1.0 (baseline)
      Parsed [   0.070s] (baseline)
    Checking datafusion-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.107s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2777
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2777

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  52.882s] datafusion-expr
    Building datafusion-expr-common v53.1.0 (current)
       Built [  18.056s] (current)
     Parsing datafusion-expr-common v53.1.0 (current)
      Parsed [   0.016s] (current)
    Building datafusion-expr-common v53.1.0 (baseline)
       Built [  18.294s] (baseline)
     Parsing datafusion-expr-common v53.1.0 (baseline)
      Parsed [   0.017s] (baseline)
    Checking datafusion-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.194s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  37.534s] datafusion-expr-common
    Building datafusion-optimizer v53.1.0 (current)
       Built [  25.705s] (current)
     Parsing datafusion-optimizer v53.1.0 (current)
      Parsed [   0.027s] (current)
    Building datafusion-optimizer v53.1.0 (baseline)
       Built [  25.773s] (baseline)
     Parsing datafusion-optimizer v53.1.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.168s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  52.874s] datafusion-optimizer
    Building datafusion-proto v53.1.0 (current)
       Built [  52.215s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.137s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  52.927s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.137s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.918s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 109.646s] datafusion-proto
    Building datafusion-pruning v53.1.0 (current)
       Built [  36.009s] (current)
     Parsing datafusion-pruning v53.1.0 (current)
      Parsed [   0.011s] (current)
    Building datafusion-pruning v53.1.0 (baseline)
       Built [  36.550s] (baseline)
     Parsing datafusion-pruning v53.1.0 (baseline)
      Parsed [   0.012s] (baseline)
    Checking datafusion-pruning v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.073s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  74.136s] datafusion-pruning
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 135.477s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.022s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 136.168s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.021s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.087s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 274.887s] datafusion-sqllogictest

@adriangb adriangb force-pushed the worktree-stats-request-api branch from ea101ed to d56a20f Compare May 3, 2026 12:52
@github-actions github-actions Bot added the logical-expr Logical plan and expressions label May 3, 2026
Comment thread datafusion/catalog-listing/src/table.rs Outdated
Comment on lines +597 to +603
// Answer any requested stats from the table-level metadata we
// already touched. Anything not derivable from the dense
// `Statistics` we computed comes back as `Absent`. Skipped
// entirely when the caller didn't ask. We also skip when
// `collect_statistics=false` — the contract is "answer what's
// free", and computing stats here just to populate this map
// would violate that.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term I think it'd be good to get rid of the dense statistics (or as a first step only create them ephemerally when we read them from the footer) but that kind of has to happen after there are no more consumers. It seemed easier to implement the sparse stats deriving from the dense stats for now.

Comment thread datafusion/datasource/src/mod.rs Outdated
/// manifests, Hive Metastore, custom catalogs) can populate this
/// directly without rebuilding a full dense `Statistics`.
pub satisfied_stats:
Option<Arc<std::collections::HashMap<StatisticsRequest, StatisticsValue>>>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A type alias for this type might be nice.

let file_stats = partitioned_file.statistics.as_ref()?;
let file_stats_pruning =
PrunableStatistics::new(vec![file_stats.clone()], Arc::clone(file_schema));
let file_stats_pruning: Box<dyn PruningStatistics + Send + Sync> =
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PruningPredicate also uses this trait, so it might not be that hard to port there too!

@adriangb adriangb force-pushed the worktree-stats-request-api branch from d56a20f to fce861f Compare May 3, 2026 13:20
@github-actions github-actions Bot removed the common Related to common crate label May 3, 2026
@github-actions github-actions Bot added optimizer Optimizer rules core Core DataFusion crate proto Related to proto crate labels May 3, 2026
@adriangb adriangb force-pushed the worktree-stats-request-api branch 3 times, most recently from 68cab2b to 8b78b77 Compare May 3, 2026 15:18
@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label May 3, 2026
Adds an opt-in handshake that lets callers ask a `TableProvider` for
specific stats by name and receive only what the provider can answer
cheaply, instead of the all-or-nothing dense `Statistics` we have today.

## What's new

* `datafusion-common::stats::StatisticsRequest` — enum of stat kinds
  that mirror `Statistics` / `ColumnStatistics` (Min, Max, NullCount,
  DistinctCount, Sum, ByteSize, RowCount, TotalByteSize). `Hash + Eq`
  so it can key a `HashMap`.

* `datafusion-common::stats::StatisticsValue` — `Scalar(Precision<...>)
  | Distribution(Arc<dyn Any>) | Sketch(Arc<dyn Any>) | Absent`. Whether
  a value is exact or estimated travels in the `Precision` wrapper, not
  the variant.

* `ScanArgs::with_statistics_requests` / `statistics_requests()` — the
  caller's question.

* `ScanResult::with_statistics` / `statistics()` / `into_parts()` — the
  provider's answer, paired 1:1 with the requests slice.

* `PartitionedFile::satisfied_stats` — sparse,
  `Arc<HashMap<StatisticsRequest, StatisticsValue>>` for per-file
  answers. Memory scales with what was asked, not with table width.
  Providers that store stats out-of-band (Delta/Iceberg/Hudi manifests,
  Hive Metastore, custom catalogs) can populate this directly without
  rebuilding a full dense `Statistics`.

* `FilePruner` learns to consume the sparse map. Internally,
  `file_stats_pruning` is now `Box<dyn PruningStatistics + Send + Sync>`
  so we can dispatch between the existing `PrunableStatistics` (dense)
  and a new `SparseFilePruningStats` adapter (sparse). The sparse
  adapter looks up each `StatisticsRequest` directly in the map and
  materializes single-row arrays only for the columns the pruning
  predicate touches — no densify-then-throw-away.

* `ListingTable::scan_with_args` populates `ScanResult.statistics` from
  the merged dense `Statistics` it already computed when
  `args.statistics_requests()` is set and `collect_statistics=true`.
  When `collect_statistics=false` it returns `Absent` for everything
  (the contract is "answer what's free"). `DistinctCount`/`Sum`/
  `ByteSize` are likewise `Absent` for parquet — those aren't in
  thrift footers; layered helpers (or richer providers) can fill the
  gaps.

## Backwards compat

All additions are opt-in:

* `ScanArgs` / `ScanResult` gain new fields with `Default`-friendly
  initializers; existing callers that don't use the new builders see
  no change.
* `FilePruner`'s field-type change is internal (private field).
* The only minor source-level break is a new pub field on
  `PartitionedFile` (`satisfied_stats`). Callers using
  `PartitionedFile::new` / `From<ObjectMeta>` / the existing builders
  are unaffected. Direct struct literals — uncommon, none in-tree —
  need to add `satisfied_stats: None` (or use the new
  `with_satisfied_stats` builder).

## Tests

* `datafusion-common::stats::tests::statistics_request_is_hashable_keyable`
  — round-trip a `StatisticsRequest` through a `HashMap`.
* `datafusion-pruning::file_pruner::tests` — three tests demonstrating
  end-to-end pruning against a sparse-only `PartitionedFile` (`x > 100`
  prunes a `[10, 20]` file, `x > 15` doesn't, no stats at all → no
  pruner).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the worktree-stats-request-api branch from 8b78b77 to 7c25d5a Compare May 3, 2026 15:50
Comment on lines +124 to +125
// Map column-name -> originating TableReference (last writer wins
// when names collide; we accept that imprecision as a POC).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be sorted out

@adriangb adriangb force-pushed the worktree-stats-request-api branch 3 times, most recently from cf15e66 to 9fdced1 Compare May 3, 2026 16:27
Stacked on top of the API-only commit. Adds the missing piece: a
small optimizer rule that walks the optimized logical plan and
populates `TableScan.statistics_requests` based on the surrounding
plan shape, plus a physical-planner hook that threads those into
`ScanArgs::with_statistics_requests`.

* `TableScan` gains `statistics_requests: Vec<StatisticsRequest>`
  (default empty) and a `with_statistics_requests` builder.

* New `RequestStatistics` `OptimizerRule` (registered last in the
  default pipeline). Walks the plan once, derives:
    Sort   → Min / Max / NullCount on each sort key
    Filter → Min / Max / NullCount / DistinctCount on referenced cols
    Join   → DistinctCount / NullCount on join keys (both sides)
    always → RowCount per scan
  Stable, deterministic ordering. Idempotent. Never reshapes the
  plan — only annotates `TableScan` nodes.

* `DefaultPhysicalPlanner` reads `scan.statistics_requests` and
  threads them into `ScanArgs::with_statistics_requests` when calling
  `provider.scan_with_args`.

* `ScanArgs::statistics_requests` field switched from
  `Option<&[StatisticsRequest]>` to `&[StatisticsRequest]` (empty
  slice = no requests; collapses two ways of saying the same thing).

* `request_statistics::tests` (3 unit tests) — confirm RowCount per
  scan, filter-column requests, join-key DistinctCount.
* `user_defined::statistics_requests` (2 e2e tests) — register a
  `RecordingTable` provider, run SQL through the full pipeline, assert
  the requests that reached `scan_with_args` match what the plan
  shape implies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the worktree-stats-request-api branch from 9fdced1 to 6514f89 Compare May 3, 2026 16:46
@asolimando
Copy link
Copy Markdown
Member

Really interesting work @adriangb, addressing an important gap, this has potential to unlock a lot of further development on the statistics front.

I've been working on several related pieces (ExpressionAnalyzer #21122, StatisticsRegistry #21483, StatisticsContext #21815) and wanted to offer some perspective on how they might connect.

On the ExpressionAnalyzer (#21122) relationship

Your framing of "orthogonal with overlap" is very convincing to me, but I wonder if the split might be even cleaner than presented. Keeping StatisticsRequest column-based (or extending to nested paths like get_field where the data source genuinely has the stat) seems like the natural boundary. For the JOIN ON coalesce(t1.col, '') = t2.col example you cite, one option would be to request stats on t1.col and let ExpressionAnalyzer derive how coalesce transforms them, rather than pushing expression semantics into the provider, which might have their own very different notion (classic impedance mismatch in DB systems). Your RequestStatistics rule already seems to be doing this naturally, since add_column_refs extracts base columns, ignoring expression structure, this feels like the right balance.

Sketches, advanced stats, and the broader statistics infrastructure

You mention "room to grow toward sketches, histograms, and selectivity stats later", which reminded me of the ExtendedStatistics mechanism introduced in the StatisticsRegistry (#21483), with exactly that use-case in mind. The StatisticsRegistry provides a way to override/enrich statistics for any operator via customizable StatisticsProvider (with existing built-in versions as base). ExtendedStatistics is its type-erased extension map that custom providers can populate and other custom ones can consume, without modifying core DataFusion types. It would be nice to think about how the request/response API from this PR could feed custom stats into ExtendedStatistics (e.g., a table provider that answers a StatisticsRequest with a sketch or histogram could surface that data where StatisticsProvider implementations can pick it up during propagation).

If StatisticsContext (#21815, in review) lands, this could find its natural home there. The longer-term idea is for StatisticsContext to also carry the ExpressionAnalyzer and StatisticsRegistry themselves, so that all statistics infrastructure (collection, expression-level derivation, operator-level propagation, and custom extensions) is accessible from a single place, opening the door for richer stats flowing end-to-end.

Happy to discuss in more detail if that would be useful!

@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented May 4, 2026

Thanks @asolimando !

Thinking about it a bit more the axis that makes the split hard in DataFusion is that DataFusion often is federated to or federeates to other systems. E.g. if the TableProvider is a Postgres table, it's completely reasonable that there is overlap in functionality (Postgres also has a planner, stats system, etc.). But DataFusion also talks to dumb Parquet files...

rather than pushing expression semantics into the provider, which might have their own very different notion (classic impedance mismatch in DB systems)

I think the key / nice thing would be to allow the implementer to decide what it can and can't provide (which I've kind of attempted in this PR). The next step would be to let it provide statistics for some parts of expressions and not others (i.e. split up the expression tree).

But maybe it's okay to sidestep this for now and say "this change is compatible with these future directions but more immediately it solves some memory usage and usability issues".

@xudong963 xudong963 self-requested a review May 6, 2026 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants