Skip to content

Parallelize infer_schema#19969

Open
Dandandan wants to merge 5 commits intoapache:mainfrom
Dandandan:infer_schema_parallel
Open

Parallelize infer_schema#19969
Dandandan wants to merge 5 commits intoapache:mainfrom
Dandandan:infer_schema_parallel

Conversation

@Dandandan
Copy link
Copy Markdown
Contributor

@Dandandan Dandandan commented Jan 24, 2026

Which issue does this PR close?

Rationale for this change

Speedup first query / cold start performance by doing more in parallel.

Before (all in a single thread):

image

After:

(This part runs on multiple threads)

image

Note, I didn't run this while having a quiet environment. Performance wise it seems to gain 10-20ms per query.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Dandandan
Copy link
Copy Markdown
Contributor Author

run benchmarks

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jan 24, 2026
@alamb-ghbot
Copy link
Copy Markdown

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing infer_schema_parallel (4773b43) to b2c29ac diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link
Copy Markdown

🤖: Benchmark completed

Details

Comparing HEAD and infer_schema_parallel
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query    ┃        HEAD ┃ infer_schema_parallel ┃    Change ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0 │  2447.71 ms │            2436.89 ms │ no change │
│ QQuery 1 │   954.86 ms │             970.89 ms │ no change │
│ QQuery 2 │  1909.56 ms │            1850.95 ms │ no change │
│ QQuery 3 │  1084.56 ms │            1048.17 ms │ no change │
│ QQuery 4 │  2239.63 ms │            2260.54 ms │ no change │
│ QQuery 5 │ 28453.34 ms │           28805.93 ms │ no change │
│ QQuery 6 │  3849.68 ms │            4010.33 ms │ no change │
│ QQuery 7 │  2839.77 ms │            2954.92 ms │ no change │
└──────────┴─────────────┴───────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 43779.11ms │
│ Total Time (infer_schema_parallel)   │ 44338.60ms │
│ Average Time (HEAD)                  │  5472.39ms │
│ Average Time (infer_schema_parallel) │  5542.33ms │
│ Queries Faster                       │          0 │
│ Queries Slower                       │          0 │
│ Queries with No Change               │          8 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ infer_schema_parallel ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     1.91 ms │               1.97 ms │     no change │
│ QQuery 1  │    53.26 ms │              50.87 ms │     no change │
│ QQuery 2  │   131.58 ms │             139.83 ms │  1.06x slower │
│ QQuery 3  │   155.70 ms │             156.07 ms │     no change │
│ QQuery 4  │  1075.07 ms │            1102.60 ms │     no change │
│ QQuery 5  │  1368.77 ms │            1426.33 ms │     no change │
│ QQuery 6  │     1.89 ms │               1.87 ms │     no change │
│ QQuery 7  │    55.72 ms │              54.25 ms │     no change │
│ QQuery 8  │  1443.09 ms │            1469.81 ms │     no change │
│ QQuery 9  │  1742.32 ms │            1841.73 ms │  1.06x slower │
│ QQuery 10 │   355.42 ms │             346.88 ms │     no change │
│ QQuery 11 │   406.00 ms │             406.95 ms │     no change │
│ QQuery 12 │  1263.18 ms │            1310.53 ms │     no change │
│ QQuery 13 │  1991.95 ms │            1985.37 ms │     no change │
│ QQuery 14 │  1249.22 ms │            1295.65 ms │     no change │
│ QQuery 15 │  1226.40 ms │            1266.53 ms │     no change │
│ QQuery 16 │  2578.56 ms │            2603.79 ms │     no change │
│ QQuery 17 │  2551.25 ms │            2538.38 ms │     no change │
│ QQuery 18 │  5772.09 ms │            4917.28 ms │ +1.17x faster │
│ QQuery 19 │   132.68 ms │             124.88 ms │ +1.06x faster │
│ QQuery 20 │  1936.77 ms │            1963.60 ms │     no change │
│ QQuery 21 │  2204.89 ms │            2225.72 ms │     no change │
│ QQuery 22 │  3792.73 ms │            3810.56 ms │     no change │
│ QQuery 23 │ 16075.77 ms │           12159.14 ms │ +1.32x faster │
│ QQuery 24 │   222.95 ms │             227.07 ms │     no change │
│ QQuery 25 │   485.14 ms │             472.12 ms │     no change │
│ QQuery 26 │   223.89 ms │             211.98 ms │ +1.06x faster │
│ QQuery 27 │  2646.49 ms │            2669.08 ms │     no change │
│ QQuery 28 │ 23364.62 ms │           21799.94 ms │ +1.07x faster │
│ QQuery 29 │   941.95 ms │             966.24 ms │     no change │
│ QQuery 30 │  1270.28 ms │            1296.32 ms │     no change │
│ QQuery 31 │  1345.41 ms │            1335.55 ms │     no change │
│ QQuery 32 │  4489.16 ms │            4602.29 ms │     no change │
│ QQuery 33 │  5728.58 ms │            5637.27 ms │     no change │
│ QQuery 34 │  5802.08 ms │            6169.94 ms │  1.06x slower │
│ QQuery 35 │  1955.34 ms │            1943.52 ms │     no change │
│ QQuery 36 │    68.07 ms │              67.59 ms │     no change │
│ QQuery 37 │    47.64 ms │              45.55 ms │     no change │
│ QQuery 38 │    66.23 ms │              68.43 ms │     no change │
│ QQuery 39 │   104.05 ms │              99.83 ms │     no change │
│ QQuery 40 │    27.90 ms │              26.73 ms │     no change │
│ QQuery 41 │    24.16 ms │              24.21 ms │     no change │
│ QQuery 42 │    20.20 ms │              20.07 ms │     no change │
└───────────┴─────────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 96400.39ms │
│ Total Time (infer_schema_parallel)   │ 90884.36ms │
│ Average Time (HEAD)                  │  2241.87ms │
│ Average Time (infer_schema_parallel) │  2113.59ms │
│ Queries Faster                       │          5 │
│ Queries Slower                       │          3 │
│ Queries with No Change               │         35 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ infer_schema_parallel ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 107.12 ms │             103.13 ms │     no change │
│ QQuery 2  │  34.97 ms │              33.04 ms │ +1.06x faster │
│ QQuery 3  │  40.26 ms │              36.30 ms │ +1.11x faster │
│ QQuery 4  │  33.59 ms │              31.04 ms │ +1.08x faster │
│ QQuery 5  │  91.39 ms │              91.68 ms │     no change │
│ QQuery 6  │  21.25 ms │              20.64 ms │     no change │
│ QQuery 7  │ 154.75 ms │             164.13 ms │  1.06x slower │
│ QQuery 8  │  43.79 ms │              41.55 ms │ +1.05x faster │
│ QQuery 9  │ 110.47 ms │             105.14 ms │     no change │
│ QQuery 10 │  67.99 ms │              66.86 ms │     no change │
│ QQuery 11 │  18.93 ms │              19.04 ms │     no change │
│ QQuery 12 │  52.44 ms │              51.51 ms │     no change │
│ QQuery 13 │  50.69 ms │              49.52 ms │     no change │
│ QQuery 14 │  16.05 ms │              15.06 ms │ +1.07x faster │
│ QQuery 15 │  30.44 ms │              30.17 ms │     no change │
│ QQuery 16 │  28.68 ms │              29.27 ms │     no change │
│ QQuery 17 │ 146.70 ms │             147.84 ms │     no change │
│ QQuery 18 │ 290.82 ms │             288.13 ms │     no change │
│ QQuery 19 │  39.81 ms │              38.37 ms │     no change │
│ QQuery 20 │  55.96 ms │              55.41 ms │     no change │
│ QQuery 21 │ 190.49 ms │             199.15 ms │     no change │
│ QQuery 22 │  22.46 ms │              23.05 ms │     no change │
└───────────┴───────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 1649.05ms │
│ Total Time (infer_schema_parallel)   │ 1640.03ms │
│ Average Time (HEAD)                  │   74.96ms │
│ Average Time (infer_schema_parallel) │   74.55ms │
│ Queries Faster                       │         5 │
│ Queries Slower                       │         1 │
│ Queries with No Change               │        16 │
│ Queries with Failure                 │         0 │
└──────────────────────────────────────┴───────────┘

@Dandandan
Copy link
Copy Markdown
Contributor Author

The benchmark doesn't reflect the improvement, as it runs 5 times and reports the lowest time (it is cached after first run).

Dandandan added a commit to Dandandan/arrow-datafusion that referenced this pull request Apr 17, 2026
…or_scan`

Wraps the metadata fetch + statistics/ordering extraction inside
`ParquetFormat::infer_stats_and_ordering` in `SpawnedTask::spawn_blocking`
with `handle.block_on` so each call runs on a separate worker thread.

`list_files_for_scan` already drives this via
`.buffer_unordered(meta_fetch_concurrency)`, so concurrent per-file work
now actually runs in parallel across threads instead of serializing the
CPU-heavy parquet metadata decode/merge onto a single async task.

Follows the same pattern as apache#19969 (parallelizing `infer_schema`).

Part of apache#19971.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dandandan Dandandan marked this pull request as ready for review April 17, 2026 10:14
# Conflicts:
#	datafusion/datasource-parquet/src/file_format.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallelize infer_schema

2 participants