You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: improve sort pushdown benchmark data and add DESC LIMIT queries (#21711)
## Which issue does this PR close?
Related to #21580
## Rationale for this change
The sort pushdown benchmark had two problems:
1. **Broken data generation**: The single-file ORDER BY approach caused
the parquet writer to merge rows from adjacent chunks at RG boundaries,
widening RG ranges to ~6M. The per-file split fix gave each file only 1
RG, so `reorder_by_statistics` (intra-file optimization) had nothing to
reorder.
2. **Missing DESC LIMIT queries**: The `sort_pushdown` benchmark only
had ASC queries (sort elimination). No queries tested the reverse scan +
TopK path (Inexact sort pushdown), which is where RG reorder, stats
init, and cumulative pruning provide 20-58x improvement.
## What changes are included in this PR?
### 1. Fix benchmark data generation
Generate **multiple files with multiple scrambled RGs each**:
- `inexact`: 3 files x ~20 RGs each
- `overlap`: 5 files x ~12 RGs each
Uses pyarrow to redistribute RGs from a sorted temp file into multiple
output files with scrambled RG order. Each RG has a narrow `l_orderkey`
range (~100K) but appears in scrambled order within its file.
### 2. Add DESC LIMIT queries to sort_pushdown benchmark
New q5-q8 for `sort_pushdown` (sorted data, `WITH ORDER`):
| Query | Description |
|-------|-------------|
| q5 | `ORDER BY l_orderkey DESC LIMIT 100` (narrow projection) |
| q6 | `ORDER BY l_orderkey DESC LIMIT 1000` (narrow projection) |
| q7 | `SELECT * ORDER BY l_orderkey DESC LIMIT 100` (wide projection) |
| q8 | `SELECT * ORDER BY l_orderkey DESC LIMIT 1000` (wide projection)
|
These test the Inexact sort pushdown path: reverse scan + TopK + dynamic
filter, which benefits from the optimizations in #21580.
## Are these changes tested?
Benchmark changes only. Verified locally:
- Data generation produces correct multi-file multi-RG output
- DESC LIMIT queries return correct results
- q5-q8 show 20-58x improvement with #21580 optimizations
## Are there any user-facing changes?
No. Adds pyarrow as a dependency for generating benchmark datasets (`pip
install pyarrow`).
Copy file name to clipboardExpand all lines: benchmarks/README.md
+41Lines changed: 41 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -900,3 +900,44 @@ This command will:
900
900
```
901
901
902
902
This runs queries against the pre-sorted dataset with the `--sorted-by EventTime` flag, which informs DataFusion that the data is pre-sorted, allowing it to optimize away redundant sort operations.
903
+
904
+
## Sort Pushdown
905
+
906
+
Benchmarks for sort pushdown optimizations on TPC-H lineitem data (SF=1).
907
+
908
+
### Variants
909
+
910
+
| Benchmark | Description |
911
+
|-----------|-------------|
912
+
|`sort_pushdown`| Baseline — no `WITH ORDER`, tests standard sort behavior |
913
+
|`sort_pushdown_sorted`| With `WITH ORDER` — tests sort elimination on sorted files |
0 commit comments