Skip to content

Commit cd3d0d3

Browse files
timsaucerclaude
andcommitted
docs: rework "subsets within a group" aggregation example
Rename the section from "Building per-group arrays" to "Comparing subsets within a group" so the heading matches the content. Rewrite the intro to lead with the problem (compare full group vs filtered subset), reframe the worked example around partially failed orders, and replace the trailing bullet list with a one-line walkthrough of the result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent bd54032 commit cd3d0d3

1 file changed

Lines changed: 24 additions & 27 deletions

File tree

docs/source/user-guide/common-operations/aggregations.rst

Lines changed: 24 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -163,27 +163,32 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
163163
f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])
164164
165165
166-
Building per-group arrays
167-
^^^^^^^^^^^^^^^^^^^^^^^^^
168-
169-
:py:func:`~datafusion.functions.array_agg` collects the values within each
170-
group into a list. Combined with ``distinct=True`` and the ``filter``
171-
argument, it lets you ask two questions of the same group in one pass —
172-
"what are all the values?" and "what are the values that satisfy some
173-
condition?".
166+
Comparing subsets within a group
167+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
168+
169+
Sometimes you need to compare the full membership of a group against a
170+
subset that meets some condition — for example, "which groups have at least
171+
one failure, but not every member failed?". The ``filter`` argument on an
172+
aggregate restricts the rows that contribute to *that* aggregate without
173+
dropping the group, so a single pass can produce both the full set and the
174+
filtered subset side by side. Pairing
175+
:py:func:`~datafusion.functions.array_agg` with ``distinct=True`` and
176+
``filter=`` is a compact way to express this: collect the distinct values
177+
of the group, collect the distinct values that satisfy the condition, then
178+
compare the two arrays.
174179

175180
Suppose each row records a line item with the supplier that fulfilled it and
176181
a flag for whether that supplier met the commit date. We want to identify
177-
orders where exactly one supplier failed, among two or more suppliers in
178-
total:
182+
*partially failed* orders — orders where at least one supplier failed but
183+
not every supplier failed:
179184

180185
.. ipython:: python
181186
182187
orders_df = ctx.from_pydict(
183188
{
184-
"order_id": [1, 1, 1, 2, 2, 3],
185-
"supplier_id": [100, 101, 102, 200, 201, 300],
186-
"failed": [False, True, False, False, False, True],
189+
"order_id": [1, 1, 1, 2, 2, 3, 4, 4],
190+
"supplier_id": [100, 101, 102, 200, 201, 300, 400, 401],
191+
"failed": [False, True, False, False, False, True, True, True],
187192
},
188193
)
189194
@@ -200,21 +205,13 @@ total:
200205
)
201206
202207
grouped.filter(
203-
(f.array_length(col("failed_suppliers")) == lit(1))
204-
& (f.array_length(col("all_suppliers")) > lit(1))
205-
).select(
206-
col("order_id"),
207-
f.array_element(col("failed_suppliers"), lit(1)).alias("the_one_bad_supplier"),
208-
)
209-
210-
Two aspects of the pattern are worth calling out:
208+
(f.array_length(col("failed_suppliers")) > lit(0))
209+
& (f.array_length(col("failed_suppliers")) < f.array_length(col("all_suppliers")))
210+
).select(col("order_id"), col("failed_suppliers"))
211211
212-
- ``filter=`` on an aggregate narrows the rows contributing to *that*
213-
aggregate only. Filtering the DataFrame before the aggregate would have
214-
dropped whole groups that no longer had any rows.
215-
- :py:func:`~datafusion.functions.array_length` tests group size without
216-
another aggregate pass, and :py:func:`~datafusion.functions.array_element`
217-
extracts a single value when you have proven the array has length one.
212+
Order 1 is partial (one of three suppliers failed). Order 2 is excluded
213+
because no supplier failed, order 3 because its only supplier failed, and
214+
order 4 because both of its suppliers failed.
218215

219216
Grouping Sets
220217
-------------

0 commit comments

Comments
 (0)