Skip to content

Add SQL based benchmarking harness, port tpch to use framework#21707

Open
Omega359 wants to merge 30 commits intoapache:mainfrom
Omega359:new_sql_benchmark
Open

Add SQL based benchmarking harness, port tpch to use framework#21707
Omega359 wants to merge 30 commits intoapache:mainfrom
Omega359:new_sql_benchmark

Conversation

@Omega359
Copy link
Copy Markdown
Contributor

@Omega359 Omega359 commented Apr 17, 2026

Which issue does this PR close?

Rationale for this change

Add a sql based benchmark framework with tpch as the initial benchmark to use this new framework. The README.md includes notes about other benchmarks which will have individual PR's after the initial work is accepted.

What changes are included in this PR?

benchmarking code only.

Are these changes tested?

Yes

Are there any user-facing changes?

benchmarks/bench.sh now uses the new framework for benchmarking tpch

Additional info

AI assisted with refactoring and writing tests. I have reviewed all AI produced code.

…k to use this new framework. The README.md includes notes about other benchmarks which will be pushed after the initial work is accepted.
@Omega359 Omega359 marked this pull request as ready for review April 17, 2026 19:37
@Omega359 Omega359 marked this pull request as draft April 18, 2026 16:50
@Omega359
Copy link
Copy Markdown
Contributor Author

Moving back to draft as there are a number of improvements I want to make

…r refactoring and writing the tests and I have reviewed all changes.
@Omega359 Omega359 marked this pull request as ready for review April 18, 2026 21:32
@Omega359
Copy link
Copy Markdown
Contributor Author

This should now be ready for review.

The sql benchmarks are organized in sub‑directories that correspond to the benchmark suites that are commonly used
in the community:

| Benchmark Suite | Description |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This readme covers all the test suites that I've converted for which I'll have additional PR's for each if this PR is merged.

@adriangb
Copy link
Copy Markdown
Contributor

I will review this coming week 😄

@Omega359 Omega359 marked this pull request as draft April 20, 2026 16:20
@Omega359 Omega359 marked this pull request as ready for review April 20, 2026 16:20
Comment thread benchmarks/bench.sh Outdated
Comment thread benchmarks/bench.sh Outdated
Comment thread benchmarks/benches/sql.rs

for (group, benchmarks) in benchmarks {
let mut group = c.benchmark_group(group);
group.sample_size(10);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't particularly useful tbh ... Criterion has a minimum sample size of 10 anyways. This is one of @adriangb's peeves with criterion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can leave a comment to this effect

Comment thread benchmarks/benches/sql.rs Outdated
Comment thread benchmarks/benches/sql.rs Outdated
Comment thread benchmarks/sql_benchmarks/tpch/queries/q22.sql Outdated
Comment thread benchmarks/src/sql_benchmark.rs Outdated
Comment thread benchmarks/src/sql_benchmark.rs Outdated
Comment thread benchmarks/sql_benchmarks/tpch/queries/q10.sql Outdated
Comment thread benchmarks/src/sql_benchmark.rs
Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great and is certainly a move in the right direction if we want to be able to run under Codspeed!

Since this doesn't break the current benchmarking setup the only cost to merging this and changing later is code churn, which is low.

My one blocking concern with this is the env vars. I think we should be able to / only support setting them via DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 cargo bench ... and such.

Comment thread benchmarks/sql_benchmarks/README.md Outdated
Comment on lines +307 to +312
| PARTITIONS | Number of partitions to process in parallel. Defaults to number of available cores. |
| BATCH_SIZE | Batch size when reading CSV or Parquet files. |
| MEM_POOL_TYPE | The memory pool type to use, should be one of "fair" or "greedy". |
| MEMORY_LIMIT | Memory limit (e.g. '100M', '1.5G'). If not specified, run all pre-defined memory limits for given query if there's any, otherwise run with no memory limit. |
| DATAFUSION_RUNTIME_MEMORY_LIMIT | Used if MEMORY_LIMIT is not set. |
| SORT_SPILL_RESERVATION_BYTES | The amount of memory to reserve for sort spill operations. DataFusion's default value will be used if not specified. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these special? Don't we already have a pattern for setting this via env vars e.g. with DataFusion CLI that we can honor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are basically what bench.sh already supported iirc ... just documented. Maybe not 'special' but not directly related to the sql benchmarks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these env vars actually work? Maybe I'm missing something but I don't see how they get picked up.

Copy link
Copy Markdown
Contributor Author

@Omega359 Omega359 Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses the existing code in options.rs via make_ctx() -> args.options.config()? -> self.update_config(config)

Comment thread benchmarks/benches/sql.rs
Comment thread benchmarks/benches/sql.rs
Comment thread benchmarks/benches/sql.rs

for (group, benchmarks) in benchmarks {
let mut group = c.benchmark_group(group);
group.sample_size(10);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly...

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 22, 2026

I ran

./benchmarks/bench.sh run tpch

And it seems to work well!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 22, 2026

run benchmark tpch

@alamb alamb changed the title Add SQL based benchmarking Add SQL based benchmarking harness, port tpch to use framework Apr 22, 2026
@alamb alamb added the performance Make DataFusion faster label Apr 22, 2026
@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4292804307-1728-6qvbk 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing new_sql_benchmark (238d0a6) to a311d14 (merge-base) diff using: tpch
Results will be posted here when complete


File an issue against this benchmark runner

Omega359 and others added 3 commits April 21, 2026 22:20
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Omega359 and others added 13 commits April 21, 2026 22:21
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
@Omega359
Copy link
Copy Markdown
Contributor Author

My one blocking concern with this is the env vars. I think we should be able to / only support setting them via DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 cargo bench ... and such.

In general I agree however I wanted to make the migration as seamless as possible for people that might have scripting in place already when we start to remove the rust implementations of the migrated benchmarks. I figure it's something that can easily be handled in the future.

@adriangbot
Copy link
Copy Markdown

Benchmark for this request hit the 7200s job deadline before finishing.

Benchmarks requested: tpch

Kubernetes message
Job was active longer than specified deadline

File an issue against this benchmark runner

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 22, 2026

I am checking this one out

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359 - this is really neat

I went through this PR at a high level, but I don't think I was able to really grok all 5000 lines of it. Sorry for all the comments, but the PR is meaty!

One way to potentially make this easier to review is by narrowing it down to only targeting the basic framework and tpch q1 (maybe in a follow on PR) and splitting documentation into current and follow on work,

What is the end state

My biggest question is how we see the end state of this project. Do we envision all benchmarks go through this new benchmark binary?

Usability Suggestions

While I think the cargo bench / criterion integration is nice for integrating into benchmark tools, I think the API in this PR that mixes environment variables and criterion will be hard to use (e.g. there is no --help)

I would love to see at least some more "standard" CLI that took arguments -- similiar to DuckDB's benchmark_runner binary -- maybe we could share the engine between that binary and the one that has a cargo interface

I also suggest making the .benchmark files more self contained (even if it means more repetition) -- see suggestions below

Lots of new (duplicate) code / tests

This seems like this PR adds A LOT of new code. If this is a temporary state (aka we can remove the other dfbench implementations after porting) this is fine, but I do think we should track the work necessary to remove the duplication

For example, do we plan to remove the current dfbench binary? Can we remove that (after we have ported over the other benchmarks)? Also this is like the 3rd copy of the TPCH queries we have in the repo. Can we remove the other as well

bot is broken

It seems like the run benchmarks command from @adriangbot don't work with this current branch (maybe that runner just needs to be updated, etc)

Testing

I ran this locally and it appears to be working well

./benchmarks/bench.sh run tpch
+ env BENCH_NAME=tpch BENCH_SIZE=1 PREFER_HASH_JOIN=true TPCH_FILE_TYPE=parquet SIMULATE_LATENCY=false bash -c 'cargo bench --bench sql'
    Finished `bench` profile [optimized] target(s) in 0.61s
     Running benches/sql.rs (/Users/andrewlamb/Software/datafusion2/target/release/deps/sql-8a1afd26e6f7601f)
Gnuplot not found, using plotters backend
Loading benchmarks...
Loaded benchmarks in 24 ms ...
Loading tpch sf1 parquet data
tpch/Q01_sf1            time:   [39.002 ms 40.001 ms 41.289 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q02_sf1            time:   [14.088 ms 14.881 ms 16.246 ms]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
Loading tpch sf1 parquet data
tpch/Q03_sf1            time:   [31.011 ms 31.323 ms 31.623 ms]
Loading tpch sf1 parquet data
tpch/Q04_sf1            time:   [13.304 ms 13.443 ms 13.633 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q05_sf1            time:   [41.118 ms 41.541 ms 41.970 ms]
Loading tpch sf1 parquet data
tpch/Q06_sf1            time:   [14.527 ms 14.777 ms 15.038 ms]
Loading tpch sf1 parquet data
tpch/Q07_sf1            time:   [46.798 ms 47.635 ms 48.576 ms]
Loading tpch sf1 parquet data
tpch/Q08_sf1            time:   [38.662 ms 39.162 ms 39.699 ms]
Loading tpch sf1 parquet data
tpch/Q09_sf1            time:   [43.146 ms 43.654 ms 44.249 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Loading tpch sf1 parquet data
tpch/Q10_sf1            time:   [51.195 ms 51.522 ms 51.887 ms]
Loading tpch sf1 parquet data
tpch/Q11_sf1            time:   [9.2154 ms 9.9021 ms 10.805 ms]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
Loading tpch sf1 parquet data
tpch/Q12_sf1            time:   [22.644 ms 23.593 ms 25.308 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q13_sf1            time:   [27.692 ms 27.962 ms 28.354 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q14_sf1            time:   [21.551 ms 21.656 ms 21.763 ms]
Loading tpch sf1 parquet data
tpch/Q15_sf1            time:   [25.845 ms 26.057 ms 26.282 ms]
Loading tpch sf1 parquet data
tpch/Q16_sf1            time:   [11.143 ms 11.731 ms 12.782 ms]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
Loading tpch sf1 parquet data
tpch/Q17_sf1            time:   [60.091 ms 62.720 ms 67.566 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q18_sf1            time:   [59.880 ms 60.180 ms 60.503 ms]
Loading tpch sf1 parquet data
tpch/Q19_sf1            time:   [29.152 ms 29.707 ms 30.378 ms]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
Loading tpch sf1 parquet data
tpch/Q20_sf1            time:   [30.738 ms 31.981 ms 34.291 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q21_sf1            time:   [49.167 ms 49.481 ms 49.822 ms]
Loading tpch sf1 parquet data
tpch/Q22_sf1            time:   [11.885 ms 12.367 ms 13.253 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

Comment thread benchmarks/sql_benchmarks/README.md Outdated

This directory contains a collection of benchmarks each driven by a simple '.benchmark' text file and sql queries
that exercise the DataFusion execution engine against a variety of benchmark suites. The sql benchmark framework
is intentionally simple so that new benchmarks and queries can be added without touching the core engine.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the key that benchmarks don't need to touch the engine? I think it is more like that changes to the benchmarks don't need to be recompiled maybe

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both actually. In development when I was debugging something I was making changes to a benchmark file and running that single benchmark alone (./bench.sh run tpch 18)

Copy link
Copy Markdown
Contributor

@alamb alamb Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice -- maybe we can also mention the rationale for not having to recompile as well


# Directives

<table>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason to use html here rather than markdown?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a limitation with markdown and blockquotes that required falling back to an html table.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- I found it harder to read

| `tpcds` | TPC‑DS queries |
| `tpch` | TPC‑H queries |

## How it works
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend linking to the duckdb benchmark runner as inspiration:https://duckdb.org/docs/current/dev/benchmark

Also I suggest following the same structure in this document and moving some of the sections up. Namely start with sections on "how to use it" before launching into the "how it works" section

For example, I think it is important to note how to run individual benchmarks:

cargo bench --bench sql 

Also, we should show how to list all available benchmarks

# does it work?
cargo bench --bench sql  -- --list

And then how to compare runs (e.g. use critcmp).

We can do this as a follow on (I would be happy to help)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work?

cargo bench --bench sql -- --list

No, that doesn't and as far as I can tell it's not possible. I tried for hours to get it to work but if it's possible I don't know how (https://bheisler.github.io/criterion.rs/book/faq.html#cargo-bench-gives-unrecognized-option-errors-for-valid-command-line-options)

That is why I think bench.sh is still required (besides for loading the data)

Comment thread benchmarks/benches/sql.rs

for (group, benchmarks) in benchmarks {
let mut group = c.benchmark_group(group);
group.sample_size(10);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can leave a comment to this effect

Comment thread benchmarks/benches/sql.rs
Comment thread benchmarks/sql_benchmarks/README.md Outdated
| BENCH_PERSIST_RESULTS | true/false to persist benchmark results. Results will be persisted in csv format so be cognizant of the size of the results. |
| BENCH_VALIDATE | true/false to validate benchmark results against persisted results or result_query's. If both `BENCH_PERSIST_RESULTS` and `BENCH_VALIDATE` are true, persist mode runs and validation is skipped. |
| SIMULATE_LATENCY | Simulate object store latency to mimic remote storage (e.g. S3). Adds random latency in the range 20-200ms to each object store operation. |
| PARTITIONS | Number of partitions to process in parallel. Defaults to number of available cores. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we maybe could just reuse https://docs.rs/datafusion/latest/datafusion/prelude/struct.SessionConfig.html#method.from_env rather than implementing special cases for many of these options

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those already exist. They are not new. The only thing added here is documentation. I'm completely fine with removing the 'leftover' options to be honest but my goal at least initially was 'least pain' in migration.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the vision is to migrate over all benchmarks to this new framework and then we'll remove the old one (and potentially remove all the old options?) I think that is senisble

Comment thread benchmarks/sql_benchmarks/README.md Outdated
Example – Run the H2O window benchmarks on the 'small' sized CSV data files:

``` bash
export BENCH_NAME=h2o
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend moving this to the start of the documentation rather than 317 lines down

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I redid the readme.md a bit to hopefully address this. It's better but I suspect someone better at writing can help a lot here.

```
subgroup groupby

template sql_benchmarks/h2o/h2o.benchmark.template
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange to me that the variable defintion here is after the template use 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of but that's the format that duckdb uses. It simplifies parsing that way.

example here is one of the benchmark files for the h2o benchmark suite:

```
subgroup groupby
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally found the extra level of indirection required by these templates to add significantly to the mental load (maybe it is just me).

I would personally find it easier if the benchmarks had the property:

  1. All the relevant content was in a single file (aka inline the contents of the template)
  2. Grouping / subgrouping, etc were defined by file name rather than by an extra group template -- that way the organization of the benchmarks is clear just from the file organization

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single file - sure, very possible. I'm not against it but it does introduce a fair bit of duplication and thus the possibility of copy/paste errors if/when benchmarks are updated. If others feel strongly about it I can make that change for the tpch benchmark in this PR easily enough.

I think I'd like to push the group/subgroup as nested directories feature to a follow on PR. I can do it here but it's a bit of churn for those who have already reviewed the code.

Comment thread benchmarks/sql_benchmarks/README.md Outdated
export BENCH_SUBGROUP=window
export H2O_BENCH_SIZE=small
export H20_FILE_TYPE=csv
cargo bench --bench sql
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the combination of using cargo bench and environment variables confusing. Most cargo bench commands I have seen use command line flags -- for example I would expect

cargo bench -- bench sql -- Q20

To run all benchmarks with Q20 in the name

However, in this case it did not seem to work that way

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other response. If it's possible with criterion I don't know how. I spent a good 1/2 day trying.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is fair.

Maybe we can consider some way to run the benchmarks not with criterion as well (aka have two different front ends to the same driver). As a follow on project perhaps

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can run with criterion just not with it and cargo bench with args together. Divan was the other benchmarking option but it has no ability to be programmatically configured (nvzqz/divan#90). Divan also hasn't been touched in over a year which makes me question whether it's being maintained.

My thought was basically to have bench.sh be the frontend for most benchmark runs. If you want to do the persisting and validation of results you would need to use the manual cargo run with the appropriate env variables set. Otherwise bench.sh with whatever env var's you want preset (for the session context to pull into the config).

Perhaps the 'running a benchmark' part of the README.md here should be moved up a level and made more generic with bench.sh being the recommended interface for benchmarks and manual approaches documented.

Honestly though I really expect most people to just use the awesome bots we have now in github.

@adriangb
Copy link
Copy Markdown
Contributor

It seems like the run benchmarks command from @adriangbot don't work with this current branch (maybe that runner just needs to be updated, etc)

It’s actually #21625 triggering timeouts. I hope that will be fixed tomorrow. I will also comment some more on this PR tomorrow. @Omega359 and I have had several conversations offline about some of your points @alamb, but we probably should have discussed that more publicly as part of this PR. Some of it is in the tracking issue, some of it is not, either way it’s hard to track the context.

@Omega359
Copy link
Copy Markdown
Contributor Author

One way to potentially make this easier to review is by narrowing it down to only targeting the basic framework and tpch q1 (maybe in a follow on PR) and splitting documentation into current and follow on work,

🤷🏻‍♂️ Most of the logic is in sql_benchmark.rs. Almost 2k lines of that file is tests. I can't really cut that down.

What is the end state

My biggest question is how we see the end state of this project. Do we envision all benchmarks go through this new benchmark binary?

Think of it as the equivalent to .slt tests for benchmarks with better statistics because of the use of criterion. All the existing benchmarks that can be redone to be completely sql based will be. I've got ports of clickbench, clickbench_extended, tpch, h20, imdb, hj, nlj, smj, tpcds (even taxi which df doesn't currently have) in https://github.com/Omega359/arrow-datafusion/tree/sql_benchmark/benchmarks/sql_benchmarks waiting. Some benchmarks cannot easily be converted and will remain (cancellation for example).

Usability Suggestions

While I think the cargo bench / criterion integration is nice for integrating into benchmark tools, I think the API in this PR that mixes environment variables and criterion will be hard to use (e.g. there is no --help)

I would love to see at least some more "standard" CLI that took arguments -- similiar to DuckDB's benchmark_runner binary -- maybe we could share the engine between that binary and the one that has a cargo interface

So ... bench.sh? It's the current entrypoint and I don't see this PR changing that.

I also suggest making the .benchmark files more self contained (even if it means more repetition)
-- see suggestions below

Sure. These files are fairly direct ports of what duckdb has. Didn't seem all that complex to me.

Lots of new (duplicate) code / tests

This seems like this PR adds A LOT of new code. If this is a temporary state (aka we can remove the other dfbench implementations after porting) this is fine, but I do think we should track the work necessary to remove the duplication

For example, do we plan to remove the current dfbench binary? Can we remove that (after we have ported over the other benchmarks)? Also this is like the 3rd copy of the TPCH queries we have in the repo. Can we remove the other as well

Yes, if this PR is accepted I plan on creating a number of PRs for followup work - adding migrations for the other benchmarks, removing current dfbench benchmarks post merge for each benchmark, improvements to bench.sh, etc.

I see this PR as the initial step in the migration of the benchmarks from being code based to being primarily sql (ish) based. It's a lot of code, yes, but it's the minimum to be able to parse and run a duckdb style benchmark. Much of it is actually just tests

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Omega359 and I have had several conversations offline about some of your points @alamb, but we

As long as we have some plan to consolidate benchmarks I think doing it incrementally sounds like a great plan to me

I don't feel like any of the issues I raised is a blocker, and this feels like a step in the right direction to me. I would be in favor of merging it in and then iterating on it as a series of follow ons

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 22, 2026

Thank you for all the work on this @Omega359 and @adriangb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants