@@ -163,27 +163,32 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
163163 f.avg(col_speed, filter = col_attack < lit(50 )).alias(" Avg Speed Low Attack" )])
164164
165165
166- Building per-group arrays
167- ^^^^^^^^^^^^^^^^^^^^^^^^^
168-
169- :py:func: `~datafusion.functions.array_agg ` collects the values within each
170- group into a list. Combined with ``distinct=True `` and the ``filter ``
171- argument, it lets you ask two questions of the same group in one pass —
172- "what are all the values?" and "what are the values that satisfy some
173- condition?".
166+ Comparing subsets within a group
167+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
168+
169+ Sometimes you need to compare the full membership of a group against a
170+ subset that meets some condition — for example, "which groups have at least
171+ one failure, but not every member failed?". The ``filter `` argument on an
172+ aggregate restricts the rows that contribute to *that * aggregate without
173+ dropping the group, so a single pass can produce both the full set and the
174+ filtered subset side by side. Pairing
175+ :py:func: `~datafusion.functions.array_agg ` with ``distinct=True `` and
176+ ``filter= `` is a compact way to express this: collect the distinct values
177+ of the group, collect the distinct values that satisfy the condition, then
178+ compare the two arrays.
174179
175180Suppose each row records a line item with the supplier that fulfilled it and
176181a flag for whether that supplier met the commit date. We want to identify
177- orders where exactly one supplier failed, among two or more suppliers in
178- total :
182+ * partially failed * orders — orders where at least one supplier failed but
183+ not every supplier failed :
179184
180185.. ipython :: python
181186
182187 orders_df = ctx.from_pydict(
183188 {
184- " order_id" : [1 , 1 , 1 , 2 , 2 , 3 ],
185- " supplier_id" : [100 , 101 , 102 , 200 , 201 , 300 ],
186- " failed" : [False , True , False , False , False , True ],
189+ " order_id" : [1 , 1 , 1 , 2 , 2 , 3 , 4 , 4 ],
190+ " supplier_id" : [100 , 101 , 102 , 200 , 201 , 300 , 400 , 401 ],
191+ " failed" : [False , True , False , False , False , True , True , True ],
187192 },
188193 )
189194
@@ -200,21 +205,13 @@ total:
200205 )
201206
202207 grouped.filter(
203- (f.array_length(col(" failed_suppliers" )) == lit(1 ))
204- & (f.array_length(col(" all_suppliers" )) > lit(1 ))
205- ).select(
206- col(" order_id" ),
207- f.array_element(col(" failed_suppliers" ), lit(1 )).alias(" the_one_bad_supplier" ),
208- )
209-
210- Two aspects of the pattern are worth calling out:
208+ (f.array_length(col(" failed_suppliers" )) > lit(0 ))
209+ & (f.array_length(col(" failed_suppliers" )) < f.array_length(col(" all_suppliers" )))
210+ ).select(col(" order_id" ), col(" failed_suppliers" ))
211211
212- - ``filter= `` on an aggregate narrows the rows contributing to *that *
213- aggregate only. Filtering the DataFrame before the aggregate would have
214- dropped whole groups that no longer had any rows.
215- - :py:func: `~datafusion.functions.array_length ` tests group size without
216- another aggregate pass, and :py:func: `~datafusion.functions.array_element `
217- extracts a single value when you have proven the array has length one.
212+ Order 1 is partial (one of three suppliers failed). Order 2 is excluded
213+ because no supplier failed, order 3 because its only supplier failed, and
214+ order 4 because both of its suppliers failed.
218215
219216Grouping Sets
220217-------------
0 commit comments