perf: parallelize CPU-heavy parquet metadata parsing in `list_files_for_scan` by Dandandan · Pull Request #21692 · apache/datafusion

Dandandan · 2026-04-17T10:10:25Z

Which issue does this PR close?

Part of Parallelize list_files_for_scan #19971.

Rationale for this change

On cold runs, list_files_for_scan bottlenecks on a single thread — the heavy part is not file listing IO but CPU work inside parquet metadata decode/merge/statistics extraction. Profiling shows all of this collapsed onto one async task even though list_files_for_scan already drives per-file work with .buffer_unordered(meta_fetch_concurrency). buffer_unordered polls futures concurrently on a single task, so CPU-bound futures serialize.

What changes are included in this PR?

Wrap the metadata fetch + statistics/ordering extraction in ParquetFormat::infer_stats_and_ordering with SpawnedTask::spawn_blocking(move || handle.block_on(...)), so each call runs on a separate worker thread. Combined with the existing buffer_unordered(meta_fetch_concurrency) in list_files_for_scan, we now get real parallelism across files.

This follows the same pattern as #19969 (parallelizing infer_schema).

Trait signatures are unchanged; only the parquet implementation is touched.

Are these changes tested?

Covered by existing parquet / catalog-listing tests (cargo test -p datafusion-datasource-parquet, cargo test -p datafusion-catalog-listing).

Are there any user-facing changes?

No API changes. Cold-start listing of parquet tables with many files should be noticeably faster on multi-core systems.

…or_scan` Wraps the metadata fetch + statistics/ordering extraction inside `ParquetFormat::infer_stats_and_ordering` in `SpawnedTask::spawn_blocking` with `handle.block_on` so each call runs on a separate worker thread. `list_files_for_scan` already drives this via `.buffer_unordered(meta_fetch_concurrency)`, so concurrent per-file work now actually runs in parallel across threads instead of serializing the CPU-heavy parquet metadata decode/merge onto a single async task. Follows the same pattern as apache#19969 (parallelizing `infer_schema`). Part of apache#19971. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the datasource Changes to the datasource crate label Apr 17, 2026

Merge branch 'main' into parallelize-list-files-for-scan

5cb8929

Dandandan marked this pull request as draft April 17, 2026 10:30

Dandandan closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallelize CPU-heavy parquet metadata parsing in `list_files_for_scan`#21692

perf: parallelize CPU-heavy parquet metadata parsing in `list_files_for_scan`#21692
Dandandan wants to merge 2 commits intoapache:mainfrom
Dandandan:parallelize-list-files-for-scan

Dandandan commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dandandan commented Apr 17, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant