Skip to content

Commit d3054a1

Browse files
timsaucerclaude
andcommitted
docs: clarify "When not to use a UDF" intro
Rewrite the opening of the section to make three things clearer: the contrast is with native DataFusion expressions (not Python in general), some predicates genuinely feel easier to write as a Python loop and that tension is worth acknowledging, and predicate pushdown is a table-provider mechanism rather than a Parquet-only feature. Parquet stays as the concrete demo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent cd3d0d3 commit d3054a1

1 file changed

Lines changed: 16 additions & 7 deletions

File tree

docs/source/user-guide/common-operations/udf-and-udfa.rst

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -105,13 +105,22 @@ When not to use a UDF
105105
^^^^^^^^^^^^^^^^^^^^^
106106

107107
A UDF is the right tool when the per-row computation genuinely cannot be
108-
expressed with built-in functions. It is often the *wrong* tool for a
109-
predicate that happens to be easier to write in Python. A UDF is opaque
110-
to the optimizer, which means filters expressed as UDFs lose several
111-
rewrites that the engine applies to filters built from native
112-
expressions. The most visible of these is **Parquet predicate pushdown**:
113-
a native predicate can prune entire row groups using the min/max
114-
statistics in the Parquet footer, while a UDF predicate cannot.
108+
expressed with DataFusion's built-in expressions. It is often the *wrong*
109+
tool for a predicate that *can* be written as an ``Expr`` tree but feels
110+
easier to write as a Python function — for example, a filter that keeps
111+
a row if it matches any one of several rule sets, where each rule set
112+
checks its own combination of columns (the worked example at the end of
113+
this section keeps a row when it matches any one of several brand-specific
114+
rules). Looping over the rules in Python and returning a boolean per row
115+
reads naturally and is tempting to wrap in a UDF, but a UDF is opaque to
116+
the optimizer: filters expressed as UDFs lose several rewrites that the
117+
engine applies to filters built from native expressions. The most visible
118+
of these is **predicate pushdown into the table provider**: a native
119+
predicate can be handed to the source so it skips data before it is read,
120+
while a UDF predicate cannot. The example below uses Parquet, where
121+
pushdown prunes whole row groups using the min/max statistics in the
122+
footer, but the same mechanism applies to any table provider that
123+
advertises filter support — including custom providers.
115124

116125
The following example writes a small Parquet file, then filters it two
117126
ways: first with a native expression, then with a UDF that computes the

0 commit comments

Comments
 (0)