Parallelize `infer_schema` by Dandandan · Pull Request #19969 · apache/datafusion

Dandandan · 2026-01-24T12:41:25Z

Which issue does this PR close?

Closes Parallelize infer_schema #19970

Rationale for this change

Speedup first query / cold start performance by doing more in parallel.

Before (all in a single thread):

After:

(This part runs on multiple threads)

Note, I didn't run this while having a quiet environment. Performance wise it seems to gain 10-20ms per query.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan · 2026-01-24T12:41:33Z

run benchmarks

alamb-ghbot · 2026-01-24T12:41:40Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing infer_schema_parallel (4773b43) to b2c29ac diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb-ghbot · 2026-01-24T13:21:26Z

🤖: Benchmark completed

Details

Comparing HEAD and infer_schema_parallel
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query    ┃        HEAD ┃ infer_schema_parallel ┃    Change ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0 │  2447.71 ms │            2436.89 ms │ no change │
│ QQuery 1 │   954.86 ms │             970.89 ms │ no change │
│ QQuery 2 │  1909.56 ms │            1850.95 ms │ no change │
│ QQuery 3 │  1084.56 ms │            1048.17 ms │ no change │
│ QQuery 4 │  2239.63 ms │            2260.54 ms │ no change │
│ QQuery 5 │ 28453.34 ms │           28805.93 ms │ no change │
│ QQuery 6 │  3849.68 ms │            4010.33 ms │ no change │
│ QQuery 7 │  2839.77 ms │            2954.92 ms │ no change │
└──────────┴─────────────┴───────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 43779.11ms │
│ Total Time (infer_schema_parallel)   │ 44338.60ms │
│ Average Time (HEAD)                  │  5472.39ms │
│ Average Time (infer_schema_parallel) │  5542.33ms │
│ Queries Faster                       │          0 │
│ Queries Slower                       │          0 │
│ Queries with No Change               │          8 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ infer_schema_parallel ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     1.91 ms │               1.97 ms │     no change │
│ QQuery 1  │    53.26 ms │              50.87 ms │     no change │
│ QQuery 2  │   131.58 ms │             139.83 ms │  1.06x slower │
│ QQuery 3  │   155.70 ms │             156.07 ms │     no change │
│ QQuery 4  │  1075.07 ms │            1102.60 ms │     no change │
│ QQuery 5  │  1368.77 ms │            1426.33 ms │     no change │
│ QQuery 6  │     1.89 ms │               1.87 ms │     no change │
│ QQuery 7  │    55.72 ms │              54.25 ms │     no change │
│ QQuery 8  │  1443.09 ms │            1469.81 ms │     no change │
│ QQuery 9  │  1742.32 ms │            1841.73 ms │  1.06x slower │
│ QQuery 10 │   355.42 ms │             346.88 ms │     no change │
│ QQuery 11 │   406.00 ms │             406.95 ms │     no change │
│ QQuery 12 │  1263.18 ms │            1310.53 ms │     no change │
│ QQuery 13 │  1991.95 ms │            1985.37 ms │     no change │
│ QQuery 14 │  1249.22 ms │            1295.65 ms │     no change │
│ QQuery 15 │  1226.40 ms │            1266.53 ms │     no change │
│ QQuery 16 │  2578.56 ms │            2603.79 ms │     no change │
│ QQuery 17 │  2551.25 ms │            2538.38 ms │     no change │
│ QQuery 18 │  5772.09 ms │            4917.28 ms │ +1.17x faster │
│ QQuery 19 │   132.68 ms │             124.88 ms │ +1.06x faster │
│ QQuery 20 │  1936.77 ms │            1963.60 ms │     no change │
│ QQuery 21 │  2204.89 ms │            2225.72 ms │     no change │
│ QQuery 22 │  3792.73 ms │            3810.56 ms │     no change │
│ QQuery 23 │ 16075.77 ms │           12159.14 ms │ +1.32x faster │
│ QQuery 24 │   222.95 ms │             227.07 ms │     no change │
│ QQuery 25 │   485.14 ms │             472.12 ms │     no change │
│ QQuery 26 │   223.89 ms │             211.98 ms │ +1.06x faster │
│ QQuery 27 │  2646.49 ms │            2669.08 ms │     no change │
│ QQuery 28 │ 23364.62 ms │           21799.94 ms │ +1.07x faster │
│ QQuery 29 │   941.95 ms │             966.24 ms │     no change │
│ QQuery 30 │  1270.28 ms │            1296.32 ms │     no change │
│ QQuery 31 │  1345.41 ms │            1335.55 ms │     no change │
│ QQuery 32 │  4489.16 ms │            4602.29 ms │     no change │
│ QQuery 33 │  5728.58 ms │            5637.27 ms │     no change │
│ QQuery 34 │  5802.08 ms │            6169.94 ms │  1.06x slower │
│ QQuery 35 │  1955.34 ms │            1943.52 ms │     no change │
│ QQuery 36 │    68.07 ms │              67.59 ms │     no change │
│ QQuery 37 │    47.64 ms │              45.55 ms │     no change │
│ QQuery 38 │    66.23 ms │              68.43 ms │     no change │
│ QQuery 39 │   104.05 ms │              99.83 ms │     no change │
│ QQuery 40 │    27.90 ms │              26.73 ms │     no change │
│ QQuery 41 │    24.16 ms │              24.21 ms │     no change │
│ QQuery 42 │    20.20 ms │              20.07 ms │     no change │
└───────────┴─────────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 96400.39ms │
│ Total Time (infer_schema_parallel)   │ 90884.36ms │
│ Average Time (HEAD)                  │  2241.87ms │
│ Average Time (infer_schema_parallel) │  2113.59ms │
│ Queries Faster                       │          5 │
│ Queries Slower                       │          3 │
│ Queries with No Change               │         35 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ infer_schema_parallel ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 107.12 ms │             103.13 ms │     no change │
│ QQuery 2  │  34.97 ms │              33.04 ms │ +1.06x faster │
│ QQuery 3  │  40.26 ms │              36.30 ms │ +1.11x faster │
│ QQuery 4  │  33.59 ms │              31.04 ms │ +1.08x faster │
│ QQuery 5  │  91.39 ms │              91.68 ms │     no change │
│ QQuery 6  │  21.25 ms │              20.64 ms │     no change │
│ QQuery 7  │ 154.75 ms │             164.13 ms │  1.06x slower │
│ QQuery 8  │  43.79 ms │              41.55 ms │ +1.05x faster │
│ QQuery 9  │ 110.47 ms │             105.14 ms │     no change │
│ QQuery 10 │  67.99 ms │              66.86 ms │     no change │
│ QQuery 11 │  18.93 ms │              19.04 ms │     no change │
│ QQuery 12 │  52.44 ms │              51.51 ms │     no change │
│ QQuery 13 │  50.69 ms │              49.52 ms │     no change │
│ QQuery 14 │  16.05 ms │              15.06 ms │ +1.07x faster │
│ QQuery 15 │  30.44 ms │              30.17 ms │     no change │
│ QQuery 16 │  28.68 ms │              29.27 ms │     no change │
│ QQuery 17 │ 146.70 ms │             147.84 ms │     no change │
│ QQuery 18 │ 290.82 ms │             288.13 ms │     no change │
│ QQuery 19 │  39.81 ms │              38.37 ms │     no change │
│ QQuery 20 │  55.96 ms │              55.41 ms │     no change │
│ QQuery 21 │ 190.49 ms │             199.15 ms │     no change │
│ QQuery 22 │  22.46 ms │              23.05 ms │     no change │
└───────────┴───────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 1649.05ms │
│ Total Time (infer_schema_parallel)   │ 1640.03ms │
│ Average Time (HEAD)                  │   74.96ms │
│ Average Time (infer_schema_parallel) │   74.55ms │
│ Queries Faster                       │         5 │
│ Queries Slower                       │         1 │
│ Queries with No Change               │        16 │
│ Queries with Failure                 │         0 │
└──────────────────────────────────────┴───────────┘

Dandandan · 2026-01-24T14:09:28Z

The benchmark doesn't reflect the improvement, as it runs 5 times and reports the lowest time (it is cached after first run).

…or_scan` Wraps the metadata fetch + statistics/ordering extraction inside `ParquetFormat::infer_stats_and_ordering` in `SpawnedTask::spawn_blocking` with `handle.block_on` so each call runs on a separate worker thread. `list_files_for_scan` already drives this via `.buffer_unordered(meta_fetch_concurrency)`, so concurrent per-file work now actually runs in parallel across threads instead of serializing the CPU-heavy parquet metadata decode/merge onto a single async task. Follows the same pattern as apache#19969 (parallelizing `infer_schema`). Part of apache#19971. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # datafusion/datasource-parquet/src/file_format.rs

Parallelize infer_schema

4773b43

github-actions Bot added the datasource Changes to the datasource crate label Jan 24, 2026

Dandandan mentioned this pull request Jan 25, 2026

Parallelize list_files_for_scan #19971

Open

Dandandan added 3 commits January 26, 2026 09:30

Improve parallelism

7596392

WIP

2888fcf

WIP

536cd99

Dandandan mentioned this pull request Apr 17, 2026

perf: parallelize CPU-heavy parquet metadata parsing in list_files_for_scan #21692

Closed

Dandandan marked this pull request as ready for review April 17, 2026 10:14

Merge remote-tracking branch 'upstream/main' into infer_schema_parallel

d329d9a

# Conflicts: # datafusion/datasource-parquet/src/file_format.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize `infer_schema`#19969

Parallelize `infer_schema`#19969
Dandandan wants to merge 5 commits intoapache:mainfrom
Dandandan:infer_schema_parallel

Dandandan commented Jan 24, 2026 •

edited

Loading

Uh oh!

Dandandan commented Jan 24, 2026

Uh oh!

alamb-ghbot commented Jan 24, 2026

Uh oh!

alamb-ghbot commented Jan 24, 2026

Uh oh!

Dandandan commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dandandan commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Jan 24, 2026

Uh oh!

alamb-ghbot commented Jan 24, 2026

Uh oh!

alamb-ghbot commented Jan 24, 2026

Uh oh!

Dandandan commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dandandan commented Jan 24, 2026 •

edited

Loading