Skip to content

perf: parallelize CPU-heavy parquet metadata parsing in list_files_for_scan#21692

Closed
Dandandan wants to merge 2 commits intoapache:mainfrom
Dandandan:parallelize-list-files-for-scan
Closed

perf: parallelize CPU-heavy parquet metadata parsing in list_files_for_scan#21692
Dandandan wants to merge 2 commits intoapache:mainfrom
Dandandan:parallelize-list-files-for-scan

Conversation

@Dandandan
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

On cold runs, list_files_for_scan bottlenecks on a single thread — the heavy part is not file listing IO but CPU work inside parquet metadata decode/merge/statistics extraction. Profiling shows all of this collapsed onto one async task even though list_files_for_scan already drives per-file work with .buffer_unordered(meta_fetch_concurrency). buffer_unordered polls futures concurrently on a single task, so CPU-bound futures serialize.

What changes are included in this PR?

Wrap the metadata fetch + statistics/ordering extraction in ParquetFormat::infer_stats_and_ordering with SpawnedTask::spawn_blocking(move || handle.block_on(...)), so each call runs on a separate worker thread. Combined with the existing buffer_unordered(meta_fetch_concurrency) in list_files_for_scan, we now get real parallelism across files.

This follows the same pattern as #19969 (parallelizing infer_schema).

Trait signatures are unchanged; only the parquet implementation is touched.

Are these changes tested?

Covered by existing parquet / catalog-listing tests (cargo test -p datafusion-datasource-parquet, cargo test -p datafusion-catalog-listing).

Are there any user-facing changes?

No API changes. Cold-start listing of parquet tables with many files should be noticeably faster on multi-core systems.

…or_scan`

Wraps the metadata fetch + statistics/ordering extraction inside
`ParquetFormat::infer_stats_and_ordering` in `SpawnedTask::spawn_blocking`
with `handle.block_on` so each call runs on a separate worker thread.

`list_files_for_scan` already drives this via
`.buffer_unordered(meta_fetch_concurrency)`, so concurrent per-file work
now actually runs in parallel across threads instead of serializing the
CPU-heavy parquet metadata decode/merge onto a single async task.

Follows the same pattern as apache#19969 (parallelizing `infer_schema`).

Part of apache#19971.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the datasource Changes to the datasource crate label Apr 17, 2026
@Dandandan Dandandan marked this pull request as draft April 17, 2026 10:30
@Dandandan Dandandan closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant