Add SQL based benchmarking harness, port tpch to use framework#21707
Add SQL based benchmarking harness, port tpch to use framework#21707Omega359 wants to merge 30 commits intoapache:mainfrom
Conversation
…k to use this new framework. The README.md includes notes about other benchmarks which will be pushed after the initial work is accepted.
|
Moving back to draft as there are a number of improvements I want to make |
…r refactoring and writing the tests and I have reviewed all changes.
|
This should now be ready for review. |
| The sql benchmarks are organized in sub‑directories that correspond to the benchmark suites that are commonly used | ||
| in the community: | ||
|
|
||
| | Benchmark Suite | Description | |
There was a problem hiding this comment.
This readme covers all the test suites that I've converted for which I'll have additional PR's for each if this PR is merged.
|
I will review this coming week 😄 |
…ven if not in validate mode.
…ution in benchmark files - ${VARIABLE:-default|true value|false value}
|
|
||
| for (group, benchmarks) in benchmarks { | ||
| let mut group = c.benchmark_group(group); | ||
| group.sample_size(10); |
There was a problem hiding this comment.
This isn't particularly useful tbh ... Criterion has a minimum sample size of 10 anyways. This is one of @adriangb's peeves with criterion.
There was a problem hiding this comment.
maybe we can leave a comment to this effect
adriangb
left a comment
There was a problem hiding this comment.
Generally looks great and is certainly a move in the right direction if we want to be able to run under Codspeed!
Since this doesn't break the current benchmarking setup the only cost to merging this and changing later is code churn, which is low.
My one blocking concern with this is the env vars. I think we should be able to / only support setting them via DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 cargo bench ... and such.
| | PARTITIONS | Number of partitions to process in parallel. Defaults to number of available cores. | | ||
| | BATCH_SIZE | Batch size when reading CSV or Parquet files. | | ||
| | MEM_POOL_TYPE | The memory pool type to use, should be one of "fair" or "greedy". | | ||
| | MEMORY_LIMIT | Memory limit (e.g. '100M', '1.5G'). If not specified, run all pre-defined memory limits for given query if there's any, otherwise run with no memory limit. | | ||
| | DATAFUSION_RUNTIME_MEMORY_LIMIT | Used if MEMORY_LIMIT is not set. | | ||
| | SORT_SPILL_RESERVATION_BYTES | The amount of memory to reserve for sort spill operations. DataFusion's default value will be used if not specified. | |
There was a problem hiding this comment.
Why are these special? Don't we already have a pattern for setting this via env vars e.g. with DataFusion CLI that we can honor?
There was a problem hiding this comment.
Those are basically what bench.sh already supported iirc ... just documented. Maybe not 'special' but not directly related to the sql benchmarks.
There was a problem hiding this comment.
Do these env vars actually work? Maybe I'm missing something but I don't see how they get picked up.
There was a problem hiding this comment.
It uses the existing code in options.rs via make_ctx() -> args.options.config()? -> self.update_config(config)
|
|
||
| for (group, benchmarks) in benchmarks { | ||
| let mut group = c.benchmark_group(group); | ||
| group.sample_size(10); |
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
|
I ran And it seems to work well! |
|
run benchmark tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing new_sql_benchmark (238d0a6) to a311d14 (merge-base) diff using: tpch File an issue against this benchmark runner |
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
In general I agree however I wanted to make the migration as seamless as possible for people that might have scripting in place already when we start to remove the rust implementations of the migrated benchmarks. I figure it's something that can easily be handled in the future. |
|
Benchmark for this request hit the 7200s job deadline before finishing. Benchmarks requested: Kubernetes messageFile an issue against this benchmark runner |
|
I am checking this one out |
alamb
left a comment
There was a problem hiding this comment.
Thank you @Omega359 - this is really neat
I went through this PR at a high level, but I don't think I was able to really grok all 5000 lines of it. Sorry for all the comments, but the PR is meaty!
One way to potentially make this easier to review is by narrowing it down to only targeting the basic framework and tpch q1 (maybe in a follow on PR) and splitting documentation into current and follow on work,
What is the end state
My biggest question is how we see the end state of this project. Do we envision all benchmarks go through this new benchmark binary?
Usability Suggestions
While I think the cargo bench / criterion integration is nice for integrating into benchmark tools, I think the API in this PR that mixes environment variables and criterion will be hard to use (e.g. there is no --help)
I would love to see at least some more "standard" CLI that took arguments -- similiar to DuckDB's benchmark_runner binary -- maybe we could share the engine between that binary and the one that has a cargo interface
I also suggest making the .benchmark files more self contained (even if it means more repetition) -- see suggestions below
Lots of new (duplicate) code / tests
This seems like this PR adds A LOT of new code. If this is a temporary state (aka we can remove the other dfbench implementations after porting) this is fine, but I do think we should track the work necessary to remove the duplication
For example, do we plan to remove the current dfbench binary? Can we remove that (after we have ported over the other benchmarks)? Also this is like the 3rd copy of the TPCH queries we have in the repo. Can we remove the other as well
bot is broken
It seems like the run benchmarks command from @adriangbot don't work with this current branch (maybe that runner just needs to be updated, etc)
Testing
I ran this locally and it appears to be working well
./benchmarks/bench.sh run tpch
+ env BENCH_NAME=tpch BENCH_SIZE=1 PREFER_HASH_JOIN=true TPCH_FILE_TYPE=parquet SIMULATE_LATENCY=false bash -c 'cargo bench --bench sql'
Finished `bench` profile [optimized] target(s) in 0.61s
Running benches/sql.rs (/Users/andrewlamb/Software/datafusion2/target/release/deps/sql-8a1afd26e6f7601f)
Gnuplot not found, using plotters backend
Loading benchmarks...
Loaded benchmarks in 24 ms ...
Loading tpch sf1 parquet data
tpch/Q01_sf1 time: [39.002 ms 40.001 ms 41.289 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q02_sf1 time: [14.088 ms 14.881 ms 16.246 ms]
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high severe
Loading tpch sf1 parquet data
tpch/Q03_sf1 time: [31.011 ms 31.323 ms 31.623 ms]
Loading tpch sf1 parquet data
tpch/Q04_sf1 time: [13.304 ms 13.443 ms 13.633 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q05_sf1 time: [41.118 ms 41.541 ms 41.970 ms]
Loading tpch sf1 parquet data
tpch/Q06_sf1 time: [14.527 ms 14.777 ms 15.038 ms]
Loading tpch sf1 parquet data
tpch/Q07_sf1 time: [46.798 ms 47.635 ms 48.576 ms]
Loading tpch sf1 parquet data
tpch/Q08_sf1 time: [38.662 ms 39.162 ms 39.699 ms]
Loading tpch sf1 parquet data
tpch/Q09_sf1 time: [43.146 ms 43.654 ms 44.249 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
Loading tpch sf1 parquet data
tpch/Q10_sf1 time: [51.195 ms 51.522 ms 51.887 ms]
Loading tpch sf1 parquet data
tpch/Q11_sf1 time: [9.2154 ms 9.9021 ms 10.805 ms]
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high severe
Loading tpch sf1 parquet data
tpch/Q12_sf1 time: [22.644 ms 23.593 ms 25.308 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q13_sf1 time: [27.692 ms 27.962 ms 28.354 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q14_sf1 time: [21.551 ms 21.656 ms 21.763 ms]
Loading tpch sf1 parquet data
tpch/Q15_sf1 time: [25.845 ms 26.057 ms 26.282 ms]
Loading tpch sf1 parquet data
tpch/Q16_sf1 time: [11.143 ms 11.731 ms 12.782 ms]
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high severe
Loading tpch sf1 parquet data
tpch/Q17_sf1 time: [60.091 ms 62.720 ms 67.566 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q18_sf1 time: [59.880 ms 60.180 ms 60.503 ms]
Loading tpch sf1 parquet data
tpch/Q19_sf1 time: [29.152 ms 29.707 ms 30.378 ms]
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high mild
Loading tpch sf1 parquet data
tpch/Q20_sf1 time: [30.738 ms 31.981 ms 34.291 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high severe
Loading tpch sf1 parquet data
tpch/Q21_sf1 time: [49.167 ms 49.481 ms 49.822 ms]
Loading tpch sf1 parquet data
tpch/Q22_sf1 time: [11.885 ms 12.367 ms 13.253 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe|
|
||
| This directory contains a collection of benchmarks each driven by a simple '.benchmark' text file and sql queries | ||
| that exercise the DataFusion execution engine against a variety of benchmark suites. The sql benchmark framework | ||
| is intentionally simple so that new benchmarks and queries can be added without touching the core engine. |
There was a problem hiding this comment.
Is the key that benchmarks don't need to touch the engine? I think it is more like that changes to the benchmarks don't need to be recompiled maybe
There was a problem hiding this comment.
Both actually. In development when I was debugging something I was making changes to a benchmark file and running that single benchmark alone (./bench.sh run tpch 18)
There was a problem hiding this comment.
nice -- maybe we can also mention the rationale for not having to recompile as well
|
|
||
| # Directives | ||
|
|
||
| <table> |
There was a problem hiding this comment.
is there any reason to use html here rather than markdown?
There was a problem hiding this comment.
There was a limitation with markdown and blockquotes that required falling back to an html table.
There was a problem hiding this comment.
I see -- I found it harder to read
| | `tpcds` | TPC‑DS queries | | ||
| | `tpch` | TPC‑H queries | | ||
|
|
||
| ## How it works |
There was a problem hiding this comment.
I recommend linking to the duckdb benchmark runner as inspiration:https://duckdb.org/docs/current/dev/benchmark
Also I suggest following the same structure in this document and moving some of the sections up. Namely start with sections on "how to use it" before launching into the "how it works" section
For example, I think it is important to note how to run individual benchmarks:
cargo bench --bench sql Also, we should show how to list all available benchmarks
# does it work?
cargo bench --bench sql -- --listAnd then how to compare runs (e.g. use critcmp).
We can do this as a follow on (I would be happy to help)
There was a problem hiding this comment.
does it work?
cargo bench --bench sql -- --list
No, that doesn't and as far as I can tell it's not possible. I tried for hours to get it to work but if it's possible I don't know how (https://bheisler.github.io/criterion.rs/book/faq.html#cargo-bench-gives-unrecognized-option-errors-for-valid-command-line-options)
That is why I think bench.sh is still required (besides for loading the data)
|
|
||
| for (group, benchmarks) in benchmarks { | ||
| let mut group = c.benchmark_group(group); | ||
| group.sample_size(10); |
There was a problem hiding this comment.
maybe we can leave a comment to this effect
| | BENCH_PERSIST_RESULTS | true/false to persist benchmark results. Results will be persisted in csv format so be cognizant of the size of the results. | | ||
| | BENCH_VALIDATE | true/false to validate benchmark results against persisted results or result_query's. If both `BENCH_PERSIST_RESULTS` and `BENCH_VALIDATE` are true, persist mode runs and validation is skipped. | | ||
| | SIMULATE_LATENCY | Simulate object store latency to mimic remote storage (e.g. S3). Adds random latency in the range 20-200ms to each object store operation. | | ||
| | PARTITIONS | Number of partitions to process in parallel. Defaults to number of available cores. | |
There was a problem hiding this comment.
It feels like we maybe could just reuse https://docs.rs/datafusion/latest/datafusion/prelude/struct.SessionConfig.html#method.from_env rather than implementing special cases for many of these options
There was a problem hiding this comment.
Those already exist. They are not new. The only thing added here is documentation. I'm completely fine with removing the 'leftover' options to be honest but my goal at least initially was 'least pain' in migration.
There was a problem hiding this comment.
So the vision is to migrate over all benchmarks to this new framework and then we'll remove the old one (and potentially remove all the old options?) I think that is senisble
| Example – Run the H2O window benchmarks on the 'small' sized CSV data files: | ||
|
|
||
| ``` bash | ||
| export BENCH_NAME=h2o |
There was a problem hiding this comment.
I recommend moving this to the start of the documentation rather than 317 lines down
There was a problem hiding this comment.
I redid the readme.md a bit to hopefully address this. It's better but I suspect someone better at writing can help a lot here.
| ``` | ||
| subgroup groupby | ||
|
|
||
| template sql_benchmarks/h2o/h2o.benchmark.template |
There was a problem hiding this comment.
It is strange to me that the variable defintion here is after the template use 🤔
There was a problem hiding this comment.
Sort of but that's the format that duckdb uses. It simplifies parsing that way.
| example here is one of the benchmark files for the h2o benchmark suite: | ||
|
|
||
| ``` | ||
| subgroup groupby |
There was a problem hiding this comment.
I personally found the extra level of indirection required by these templates to add significantly to the mental load (maybe it is just me).
I would personally find it easier if the benchmarks had the property:
- All the relevant content was in a single file (aka inline the contents of the template)
- Grouping / subgrouping, etc were defined by file name rather than by an extra
grouptemplate -- that way the organization of the benchmarks is clear just from the file organization
There was a problem hiding this comment.
Single file - sure, very possible. I'm not against it but it does introduce a fair bit of duplication and thus the possibility of copy/paste errors if/when benchmarks are updated. If others feel strongly about it I can make that change for the tpch benchmark in this PR easily enough.
I think I'd like to push the group/subgroup as nested directories feature to a follow on PR. I can do it here but it's a bit of churn for those who have already reviewed the code.
| export BENCH_SUBGROUP=window | ||
| export H2O_BENCH_SIZE=small | ||
| export H20_FILE_TYPE=csv | ||
| cargo bench --bench sql |
There was a problem hiding this comment.
I found the combination of using cargo bench and environment variables confusing. Most cargo bench commands I have seen use command line flags -- for example I would expect
cargo bench -- bench sql -- Q20
To run all benchmarks with Q20 in the name
However, in this case it did not seem to work that way
There was a problem hiding this comment.
See my other response. If it's possible with criterion I don't know how. I spent a good 1/2 day trying.
There was a problem hiding this comment.
that is fair.
Maybe we can consider some way to run the benchmarks not with criterion as well (aka have two different front ends to the same driver). As a follow on project perhaps
There was a problem hiding this comment.
We can run with criterion just not with it and cargo bench with args together. Divan was the other benchmarking option but it has no ability to be programmatically configured (nvzqz/divan#90). Divan also hasn't been touched in over a year which makes me question whether it's being maintained.
My thought was basically to have bench.sh be the frontend for most benchmark runs. If you want to do the persisting and validation of results you would need to use the manual cargo run with the appropriate env variables set. Otherwise bench.sh with whatever env var's you want preset (for the session context to pull into the config).
Perhaps the 'running a benchmark' part of the README.md here should be moved up a level and made more generic with bench.sh being the recommended interface for benchmarks and manual approaches documented.
Honestly though I really expect most people to just use the awesome bots we have now in github.
It’s actually #21625 triggering timeouts. I hope that will be fixed tomorrow. I will also comment some more on this PR tomorrow. @Omega359 and I have had several conversations offline about some of your points @alamb, but we probably should have discussed that more publicly as part of this PR. Some of it is in the tracking issue, some of it is not, either way it’s hard to track the context. |
🤷🏻♂️ Most of the logic is in sql_benchmark.rs. Almost 2k lines of that file is tests. I can't really cut that down.
Think of it as the equivalent to .slt tests for benchmarks with better statistics because of the use of criterion. All the existing benchmarks that can be redone to be completely sql based will be. I've got ports of clickbench, clickbench_extended, tpch, h20, imdb, hj, nlj, smj, tpcds (even taxi which df doesn't currently have) in https://github.com/Omega359/arrow-datafusion/tree/sql_benchmark/benchmarks/sql_benchmarks waiting. Some benchmarks cannot easily be converted and will remain (cancellation for example).
So ... bench.sh? It's the current entrypoint and I don't see this PR changing that.
Sure. These files are fairly direct ports of what duckdb has. Didn't seem all that complex to me.
Yes, if this PR is accepted I plan on creating a number of PRs for followup work - adding migrations for the other benchmarks, removing current dfbench benchmarks post merge for each benchmark, improvements to bench.sh, etc. I see this PR as the initial step in the migration of the benchmarks from being code based to being primarily sql (ish) based. It's a lot of code, yes, but it's the minimum to be able to parse and run a duckdb style benchmark. Much of it is actually just tests |
alamb
left a comment
There was a problem hiding this comment.
@Omega359 and I have had several conversations offline about some of your points @alamb, but we
As long as we have some plan to consolidate benchmarks I think doing it incrementally sounds like a great plan to me
I don't feel like any of the issues I raised is a blocker, and this feels like a step in the right direction to me. I would be in favor of merging it in and then iterating on it as a series of follow ons
Which issue does this PR close?
Rationale for this change
Add a sql based benchmark framework with tpch as the initial benchmark to use this new framework. The README.md includes notes about other benchmarks which will have individual PR's after the initial work is accepted.
What changes are included in this PR?
benchmarking code only.
Are these changes tested?
Yes
Are there any user-facing changes?
benchmarks/bench.sh now uses the new framework for benchmarking tpch
Additional info
AI assisted with refactoring and writing tests. I have reviewed all AI produced code.