Commit 8b5c9b4
authored
feat: generate reversed-name data for sort pushdown benchmark (#21266)
## Which issue does this PR close?
Related to #17348
Precursor to #21182
## Rationale for this change
The sort pushdown benchmark (#21213) uses TPC-H data where file names
happen to match sort key order, so the optimization in #21182 shows no
difference vs. main
([comment](#21182 (comment))).
This PR generates custom benchmark data with **reversed file names** so
the sort pushdown optimizer must reorder files by statistics to achieve
sort elimination.
## What changes are included in this PR?
Updated `data_sort_pushdown` in `bench.sh` to use `tpchgen --parts=3`
and rename files:
```
tpchgen produces 3 sorted, non-overlapping parquet files:
lineitem.1.parquet: l_orderkey 1 ~ 2M (lowest keys)
lineitem.2.parquet: l_orderkey 2M ~ 4M
lineitem.3.parquet: l_orderkey 4M ~ 6M (highest keys)
Renamed so alphabetical order is reversed vs key order:
a_part3.parquet → highest keys, sorts first alphabetically
b_part2.parquet
c_part1.parquet → lowest keys, sorts last alphabetically
```
No datafusion-cli needed — just `tpchgen-cli` + `mv`.
## Benchmark Results
With #21182 optimization (release build, 6M rows, single partition):
**On main (no optimization)**: files read in alphabetical order
`[a_part3, b_part2, c_part1]` → wrong order → SortExec stays
**With optimization**: files reordered by statistics `[c_part1, b_part2,
a_part3]` → non-overlapping → SortExec eliminated
| Query | Description | Main (ms) | PR #21182 (ms) | Speedup |
|-------|-------------|-----------|----------------|---------|
| Q1 | `ORDER BY ASC` (full scan) | 259 | 122 | **53%** |
| Q2 | `ORDER BY ASC LIMIT 100` | 80 | 9 | **89%** |
| Q3 | `SELECT * ORDER BY ASC` | 700 | 353 | **50%** |
| Q4 | `SELECT * LIMIT 100` | 342 | 24 | **93%** |
LIMIT queries show the biggest improvement because sort elimination +
limit pushdown means only the first ~100 rows are read before stopping.
## Test plan
- [x] `cargo clippy -p datafusion-benchmarks` — 0 warnings
- [x] Local benchmark verified with reversed-name data
🤖 Generated with [Claude Code](https://claude.com/claude-code)1 parent ba873e0 commit 8b5c9b4
1 file changed
Lines changed: 43 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
314 | 314 | | |
315 | 315 | | |
316 | 316 | | |
317 | | - | |
318 | | - | |
| 317 | + | |
319 | 318 | | |
320 | 319 | | |
321 | 320 | | |
| |||
1085 | 1084 | | |
1086 | 1085 | | |
1087 | 1086 | | |
| 1087 | + | |
| 1088 | + | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
| 1096 | + | |
| 1097 | + | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + | |
| 1102 | + | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
| 1107 | + | |
| 1108 | + | |
| 1109 | + | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
| 1113 | + | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
| 1123 | + | |
| 1124 | + | |
1088 | 1125 | | |
1089 | | - | |
| 1126 | + | |
1090 | 1127 | | |
1091 | 1128 | | |
1092 | | - | |
| 1129 | + | |
1093 | 1130 | | |
1094 | 1131 | | |
1095 | 1132 | | |
1096 | 1133 | | |
1097 | | - | |
| 1134 | + | |
1098 | 1135 | | |
1099 | 1136 | | |
1100 | | - | |
| 1137 | + | |
1101 | 1138 | | |
1102 | 1139 | | |
1103 | 1140 | | |
| |||
0 commit comments