Skip to content

Commit 4321c50

Browse files
yashydgandhi
authored andcommitted
test: SQL coverage for COUNT(DISTINCT lower/CAST) with multi-distinct rewrite
✅ End-to-end: COUNT(DISTINCT lower(b)) with 'Abc'/'aBC' plus second distinct on c (case collapse = 1). ✅ End-to-end: COUNT(DISTINCT CAST(x AS INT)) with 1.2/1.3 vs second CAST distinct on y (int collision = 1). ✅ docs: REPLY_PR_20940_DANDANDAN.md — integration table rows + note on safe distinct args (lower/upper/CAST). ❌ No optimizer or engine behavior change; asserts semantics match non-rewritten aggregation. Made-with: Cursor
1 parent edab328 commit 4321c50

2 files changed

Lines changed: 157 additions & 1 deletion

File tree

datafusion/core/tests/sql/aggregates/multi_distinct_count_rewrite.rs

Lines changed: 75 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
//! End-to-end SQL tests for the multi-`COUNT(DISTINCT)` logical optimizer rewrite.
1919
2020
use super::*;
21-
use arrow::array::{Int32Array, StringArray};
21+
use arrow::array::{Float64Array, Int32Array, StringArray};
2222
use datafusion::common::test_util::batches_to_sort_string;
2323
use datafusion_catalog::MemTable;
2424

@@ -131,3 +131,77 @@ async fn multi_count_distinct_two_group_keys_matches_expected() -> Result<()> {
131131
);
132132
Ok(())
133133
}
134+
135+
/// `COUNT(DISTINCT lower(b))` with `'Abc'` / `'aBC'`: distinct is on the **lowered** value (one bucket).
136+
/// Two `COUNT(DISTINCT …)` so the rewrite applies; semantics match plain aggregation.
137+
#[tokio::test]
138+
async fn multi_count_distinct_lower_matches_expected_case_collapsing() -> Result<()> {
139+
let ctx = SessionContext::new();
140+
let schema = Arc::new(Schema::new(vec![
141+
Field::new("g", DataType::Int32, false),
142+
Field::new("b", DataType::Utf8, false),
143+
Field::new("c", DataType::Utf8, false),
144+
]));
145+
let batch = RecordBatch::try_new(
146+
schema.clone(),
147+
vec![
148+
Arc::new(Int32Array::from(vec![1, 1])),
149+
Arc::new(StringArray::from(vec!["Abc", "aBC"])),
150+
Arc::new(StringArray::from(vec!["x", "y"])),
151+
],
152+
)?;
153+
let provider = MemTable::try_new(schema, vec![vec![batch]])?;
154+
ctx.register_table("t", Arc::new(provider))?;
155+
156+
let sql = "SELECT g, COUNT(DISTINCT lower(b)) AS lb, COUNT(DISTINCT c) AS cc \
157+
FROM t GROUP BY g";
158+
let batches = ctx.sql(sql).await?.collect().await?;
159+
let out = batches_to_sort_string(&batches);
160+
161+
assert_eq!(
162+
out,
163+
"+---+----+----+\n\
164+
| g | lb | cc |\n\
165+
+---+----+----+\n\
166+
| 1 | 1 | 2 |\n\
167+
+---+----+----+"
168+
);
169+
Ok(())
170+
}
171+
172+
/// `COUNT(DISTINCT CAST(x AS INT))` with `1.2` and `1.3`: both truncate to `1` → one distinct.
173+
/// Exercises the same “expression in distinct, not raw column” path as `CAST` in the rule.
174+
#[tokio::test]
175+
async fn multi_count_distinct_cast_float_to_int_collapses_nearby_values() -> Result<()> {
176+
let ctx = SessionContext::new();
177+
let schema = Arc::new(Schema::new(vec![
178+
Field::new("g", DataType::Int32, false),
179+
Field::new("x", DataType::Float64, false),
180+
Field::new("y", DataType::Float64, false),
181+
]));
182+
let batch = RecordBatch::try_new(
183+
schema.clone(),
184+
vec![
185+
Arc::new(Int32Array::from(vec![1, 1])),
186+
Arc::new(Float64Array::from(vec![1.2, 1.3])),
187+
Arc::new(Float64Array::from(vec![10.0, 20.0])),
188+
],
189+
)?;
190+
let provider = MemTable::try_new(schema, vec![vec![batch]])?;
191+
ctx.register_table("t", Arc::new(provider))?;
192+
193+
let sql = "SELECT g, COUNT(DISTINCT CAST(x AS INT)) AS cx, COUNT(DISTINCT CAST(y AS INT)) AS cy \
194+
FROM t GROUP BY g";
195+
let batches = ctx.sql(sql).await?.collect().await?;
196+
let out = batches_to_sort_string(&batches);
197+
198+
assert_eq!(
199+
out,
200+
"+---+----+----+\n\
201+
| g | cx | cy |\n\
202+
+---+----+----+\n\
203+
| 1 | 1 | 2 |\n\
204+
+---+----+----+"
205+
);
206+
Ok(())
207+
}
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Draft reply: PR #20940 (`MultiDistinctToCrossJoin`) vs `MultiDistinctCountRewrite`
2+
3+
Paste into [apache/datafusion#20940](https://github.com/apache/datafusion/pull/20940) as a comment (GitHub-flavored Markdown).
4+
5+
---
6+
7+
Hi @Dandandan — thanks for this work; the cross-join split for **multiple distinct aggregates with no `GROUP BY`** is a strong fit for workloads like ClickBench extended.
8+
9+
I’ve been working on a related but **different** pattern: **`GROUP BY` + several `COUNT(DISTINCT …)`** in the same aggregate (typical BI). In that situation, your rule **does not apply**, because `MultiDistinctToCrossJoin` needs an **empty** `GROUP BY` and **all** aggregates to be distinct on different columns.
10+
11+
A concrete example from our benchmark suite (**category `08_complex_analytical`, query `Q8.3`**) on an `orders_data` table:
12+
13+
```sql
14+
-- Q8.3: Seller performance analysis
15+
SELECT
16+
seller_name,
17+
COUNT(*) as total_orders,
18+
COUNT(DISTINCT delivery_city) as cities_served,
19+
COUNT(DISTINCT state) as states_served,
20+
SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0 END) as completed_orders,
21+
SUM(CASE WHEN order_status = 'Cancelled' THEN 1 ELSE 0 END) as cancelled_orders,
22+
ROUND(100.0 * SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate
23+
FROM orders_data
24+
GROUP BY seller_name
25+
HAVING COUNT(*) > 100
26+
ORDER BY total_orders DESC
27+
LIMIT 100;
28+
```
29+
30+
This is **not** “global” multi-distinct: it’s **per `seller_name`**, with **multiple `COUNT(DISTINCT …)`** plus other aggregates. That’s the class my optimizer rule (`MultiDistinctCountRewrite`) targets — rewriting the **`COUNT(DISTINCT …)`** pieces into **joinable sub-aggregates aligned on the same `GROUP BY` keys**, with correct `NULL` handling where needed.
31+
32+
So in simple terms:
33+
34+
| | **Your PR (`MultiDistinctToCrossJoin`)** | **My work (`MultiDistinctCountRewrite`)** |
35+
|---|------------------------------------------|-------------------------------------------|
36+
| **Typical SQL** | `SELECT COUNT(DISTINCT a), COUNT(DISTINCT b) FROM t` (no `GROUP BY`) | `SELECT …, COUNT(DISTINCT x), COUNT(DISTINCT y), … FROM t GROUP BY …` |
37+
| **Example workload** | ClickBench extended–style **Q0 / Q1** | Our **Q8.3** (and similar grouped BI queries) |
38+
39+
They’re **complementary**: different predicates, different plans, and they can **coexist** in the optimizer pipeline (we’d want to sanity-check rule order so we don’t double-rewrite the same node).
40+
41+
---
42+
43+
## Tests for `MultiDistinctCountRewrite` (what they cover)
44+
45+
### Optimizer unit tests — `datafusion/optimizer/src/multi_distinct_count_rewrite.rs`
46+
47+
| Test | What it asserts |
48+
|------|-----------------|
49+
| `rewrites_two_count_distinct` | `GROUP BY a` + `COUNT(DISTINCT b)`, `COUNT(DISTINCT c)` → inner joins, per-branch null filters on `b`/`c`, `mdc_base` + two `mdc_d` aliases. |
50+
| `rewrites_global_three_count_distinct` | No `GROUP BY`, three `COUNT(DISTINCT …)` → cross/inner join rewrite; **no** `mdc_base` (global-only path). |
51+
| `rewrites_two_count_distinct_with_non_distinct_count` | Grouped BI-style: two distincts + `COUNT(a)` → join rewrite with **`mdc_base`** holding the non-distinct agg. |
52+
| `does_not_rewrite_two_count_distinct_same_column` | Two `COUNT(DISTINCT b)` with different aliases → **no** rewrite (duplicate distinct key). |
53+
| `does_not_rewrite_single_count_distinct` | Only one `COUNT(DISTINCT …)`**no** rewrite (rule needs ≥2 distincts). |
54+
| `rewrites_three_count_distinct_grouped` | Three grouped `COUNT(DISTINCT …)` on `b`, `c`, `a`**two** inner joins + `mdc_base`. |
55+
| `rewrites_interleaved_non_distinct_between_distincts` | Order `COUNT(DISTINCT b)`, `COUNT(a)`, `COUNT(DISTINCT c)` → rewrite + `mdc_base` for the middle non-distinct agg (projection order / interleaving). |
56+
| `rewrites_count_distinct_on_cast_exprs` | `COUNT(DISTINCT CAST(b AS Int64))`, same for `c` → rewrite + null filters on the **cast** expressions. |
57+
| `does_not_rewrite_grouping_sets_multi_distinct` | `GROUPING SETS` aggregate with two `COUNT(DISTINCT …)`**no** rewrite (rule bails on grouping sets). |
58+
| `does_not_rewrite_mixed_agg` | `COUNT(DISTINCT b)` + `COUNT(c)`**no** rewrite (only **one** `COUNT(DISTINCT …)`; rule requires at least two). |
59+
60+
### SQL integration — `datafusion/core/tests/sql/aggregates/multi_distinct_count_rewrite.rs`
61+
62+
| Test | What it asserts |
63+
|------|-----------------|
64+
| `multi_count_distinct_matches_expected_with_nulls` | End-to-end grouped two `COUNT(DISTINCT …)` with **NULLs** in distinct columns; exact sorted batch string vs expected counts. |
65+
| `multi_count_distinct_with_count_star_matches_expected` | `COUNT(*)` plus two `COUNT(DISTINCT …)` per group (BI-style); exact result table. |
66+
| `multi_count_distinct_two_group_keys_matches_expected` | **`GROUP BY g1, g2`** + two distincts; verifies joins line up on **all** group keys and numerics match. |
67+
| `multi_count_distinct_lower_matches_expected_case_collapsing` | `COUNT(DISTINCT lower(b))` with `'Abc'` / `'aBC'` plus a second distinct on `c`**one** distinct lowered value, **two** raw `c` values (semantics follow the expression inside `COUNT(DISTINCT …)`, not raw `b`). |
68+
| `multi_count_distinct_cast_float_to_int_collapses_nearby_values` | `COUNT(DISTINCT CAST(x AS INT))` with `1.2` / `1.3` (both → `1`) vs a second distinct on `y` → exercises **cast collision** the same way as the logical-plan `CAST(column)` tests. |
69+
70+
### Note: `lower(b)` and `CAST` inside `COUNT(DISTINCT …)` (reviewer question)
71+
72+
The rule only rewrites when each distinct aggregate is a **simple** `COUNT(DISTINCT expr)` with `expr` that is:
73+
74+
- a column,
75+
- `lower`/`upper` of one column, or
76+
- `CAST` of one column (non-volatile).
77+
78+
The rewrite **does not change SQL semantics**: distinct is computed on the **evaluated** values of that expression (so `'Abc'` and `'aBC'` under `lower(b)` collapse to one distinct; `1.2` and `1.3` under `CAST(x AS INT)` collapse to one distinct). The two SQL tests above lock that in end-to-end alongside the multi-distinct rewrite.
79+
80+
---
81+
82+
Happy to align naming, tests, and placement with you and the maintainers.

0 commit comments

Comments
 (0)