|
| 1 | +# Draft reply: PR #20940 (`MultiDistinctToCrossJoin`) vs `MultiDistinctCountRewrite` |
| 2 | + |
| 3 | +Paste into [apache/datafusion#20940](https://github.com/apache/datafusion/pull/20940) as a comment (GitHub-flavored Markdown). |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +Hi @Dandandan — thanks for this work; the cross-join split for **multiple distinct aggregates with no `GROUP BY`** is a strong fit for workloads like ClickBench extended. |
| 8 | + |
| 9 | +I’ve been working on a related but **different** pattern: **`GROUP BY` + several `COUNT(DISTINCT …)`** in the same aggregate (typical BI). In that situation, your rule **does not apply**, because `MultiDistinctToCrossJoin` needs an **empty** `GROUP BY` and **all** aggregates to be distinct on different columns. |
| 10 | + |
| 11 | +A concrete example from our benchmark suite (**category `08_complex_analytical`, query `Q8.3`**) on an `orders_data` table: |
| 12 | + |
| 13 | +```sql |
| 14 | +-- Q8.3: Seller performance analysis |
| 15 | +SELECT |
| 16 | + seller_name, |
| 17 | + COUNT(*) as total_orders, |
| 18 | + COUNT(DISTINCT delivery_city) as cities_served, |
| 19 | + COUNT(DISTINCT state) as states_served, |
| 20 | + SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0 END) as completed_orders, |
| 21 | + SUM(CASE WHEN order_status = 'Cancelled' THEN 1 ELSE 0 END) as cancelled_orders, |
| 22 | + ROUND(100.0 * SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate |
| 23 | +FROM orders_data |
| 24 | +GROUP BY seller_name |
| 25 | +HAVING COUNT(*) > 100 |
| 26 | +ORDER BY total_orders DESC |
| 27 | +LIMIT 100; |
| 28 | +``` |
| 29 | + |
| 30 | +This is **not** “global” multi-distinct: it’s **per `seller_name`**, with **multiple `COUNT(DISTINCT …)`** plus other aggregates. That’s the class my optimizer rule (`MultiDistinctCountRewrite`) targets — rewriting the **`COUNT(DISTINCT …)`** pieces into **joinable sub-aggregates aligned on the same `GROUP BY` keys**, with correct `NULL` handling where needed. |
| 31 | + |
| 32 | +So in simple terms: |
| 33 | + |
| 34 | +| | **Your PR (`MultiDistinctToCrossJoin`)** | **My work (`MultiDistinctCountRewrite`)** | |
| 35 | +|---|------------------------------------------|-------------------------------------------| |
| 36 | +| **Typical SQL** | `SELECT COUNT(DISTINCT a), COUNT(DISTINCT b) FROM t` (no `GROUP BY`) | `SELECT …, COUNT(DISTINCT x), COUNT(DISTINCT y), … FROM t GROUP BY …` | |
| 37 | +| **Example workload** | ClickBench extended–style **Q0 / Q1** | Our **Q8.3** (and similar grouped BI queries) | |
| 38 | + |
| 39 | +They’re **complementary**: different predicates, different plans, and they can **coexist** in the optimizer pipeline (we’d want to sanity-check rule order so we don’t double-rewrite the same node). |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## Tests for `MultiDistinctCountRewrite` (what they cover) |
| 44 | + |
| 45 | +### Optimizer unit tests — `datafusion/optimizer/src/multi_distinct_count_rewrite.rs` |
| 46 | + |
| 47 | +| Test | What it asserts | |
| 48 | +|------|-----------------| |
| 49 | +| `rewrites_two_count_distinct` | `GROUP BY a` + `COUNT(DISTINCT b)`, `COUNT(DISTINCT c)` → inner joins, per-branch null filters on `b`/`c`, `mdc_base` + two `mdc_d` aliases. | |
| 50 | +| `rewrites_global_three_count_distinct` | No `GROUP BY`, three `COUNT(DISTINCT …)` → cross/inner join rewrite; **no** `mdc_base` (global-only path). | |
| 51 | +| `rewrites_two_count_distinct_with_non_distinct_count` | Grouped BI-style: two distincts + `COUNT(a)` → join rewrite with **`mdc_base`** holding the non-distinct agg. | |
| 52 | +| `does_not_rewrite_two_count_distinct_same_column` | Two `COUNT(DISTINCT b)` with different aliases → **no** rewrite (duplicate distinct key). | |
| 53 | +| `does_not_rewrite_single_count_distinct` | Only one `COUNT(DISTINCT …)` → **no** rewrite (rule needs ≥2 distincts). | |
| 54 | +| `rewrites_three_count_distinct_grouped` | Three grouped `COUNT(DISTINCT …)` on `b`, `c`, `a` → **two** inner joins + `mdc_base`. | |
| 55 | +| `rewrites_interleaved_non_distinct_between_distincts` | Order `COUNT(DISTINCT b)`, `COUNT(a)`, `COUNT(DISTINCT c)` → rewrite + `mdc_base` for the middle non-distinct agg (projection order / interleaving). | |
| 56 | +| `rewrites_count_distinct_on_cast_exprs` | `COUNT(DISTINCT CAST(b AS Int64))`, same for `c` → rewrite + null filters on the **cast** expressions. | |
| 57 | +| `does_not_rewrite_grouping_sets_multi_distinct` | `GROUPING SETS` aggregate with two `COUNT(DISTINCT …)` → **no** rewrite (rule bails on grouping sets). | |
| 58 | +| `does_not_rewrite_mixed_agg` | `COUNT(DISTINCT b)` + `COUNT(c)` → **no** rewrite (only **one** `COUNT(DISTINCT …)`; rule requires at least two). | |
| 59 | + |
| 60 | +### SQL integration — `datafusion/core/tests/sql/aggregates/multi_distinct_count_rewrite.rs` |
| 61 | + |
| 62 | +| Test | What it asserts | |
| 63 | +|------|-----------------| |
| 64 | +| `multi_count_distinct_matches_expected_with_nulls` | End-to-end grouped two `COUNT(DISTINCT …)` with **NULLs** in distinct columns; exact sorted batch string vs expected counts. | |
| 65 | +| `multi_count_distinct_with_count_star_matches_expected` | `COUNT(*)` plus two `COUNT(DISTINCT …)` per group (BI-style); exact result table. | |
| 66 | +| `multi_count_distinct_two_group_keys_matches_expected` | **`GROUP BY g1, g2`** + two distincts; verifies joins line up on **all** group keys and numerics match. | |
| 67 | +| `multi_count_distinct_lower_matches_expected_case_collapsing` | `COUNT(DISTINCT lower(b))` with `'Abc'` / `'aBC'` plus a second distinct on `c` → **one** distinct lowered value, **two** raw `c` values (semantics follow the expression inside `COUNT(DISTINCT …)`, not raw `b`). | |
| 68 | +| `multi_count_distinct_cast_float_to_int_collapses_nearby_values` | `COUNT(DISTINCT CAST(x AS INT))` with `1.2` / `1.3` (both → `1`) vs a second distinct on `y` → exercises **cast collision** the same way as the logical-plan `CAST(column)` tests. | |
| 69 | + |
| 70 | +### Note: `lower(b)` and `CAST` inside `COUNT(DISTINCT …)` (reviewer question) |
| 71 | + |
| 72 | +The rule only rewrites when each distinct aggregate is a **simple** `COUNT(DISTINCT expr)` with `expr` that is: |
| 73 | + |
| 74 | +- a column, |
| 75 | +- `lower`/`upper` of one column, or |
| 76 | +- `CAST` of one column (non-volatile). |
| 77 | + |
| 78 | +The rewrite **does not change SQL semantics**: distinct is computed on the **evaluated** values of that expression (so `'Abc'` and `'aBC'` under `lower(b)` collapse to one distinct; `1.2` and `1.3` under `CAST(x AS INT)` collapse to one distinct). The two SQL tests above lock that in end-to-end alongside the multi-distinct rewrite. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +Happy to align naming, tests, and placement with you and the maintainers. |
0 commit comments