Skip to content

Commit f997169

Browse files
alambdrin2010YOUY01sdf-jkl
authored
Improve documentation for ScalarUDFImpl::preimage (#20008)
## Which issue does this PR close? - Related to #18319 (comment) ## Rationale for this change As @sdf-jkl @drin @2010YOUY01 and I have been discussing in #18319 (comment), the new `preimage` API is somewhat subtle. @sdf-jkl did a great job with the initial documentation but let's try and reduce the cognative load slightly ## What changes are included in this PR? Rework the documentation to try and make it easier to understand ## Are these changes tested? Checked by doc CI, and I rendered it locally: <img width="1060" height="858" alt="Screenshot 2026-01-26 at 8 06 57 AM" src="https://github.com/user-attachments/assets/27bcaea8-833e-4c0c-9bc6-c9656e15641d" /> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Aldrin M <octalene.dev@pm.me> Co-authored-by: Yongting You <2010youy01@gmail.com> Co-authored-by: Kosta Tarasov <33369833+sdf-jkl@users.noreply.github.com>
1 parent ed0a060 commit f997169

2 files changed

Lines changed: 90 additions & 12 deletions

File tree

datafusion/expr/src/udf.rs

Lines changed: 90 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -709,22 +709,101 @@ pub trait ScalarUDFImpl: Debug + DynEq + DynHash + Send + Sync {
709709
Ok(ExprSimplifyResult::Original(args))
710710
}
711711

712-
/// Returns the [preimage] for this function and the specified scalar value, if any.
712+
/// Returns a single contiguous preimage for this function and the specified
713+
/// scalar expression, if any.
714+
///
715+
/// Currently only applies to `=, !=, >, >=, <, <=, is distinct from, is not distinct from` predicates
716+
/// # Return Value
717+
///
718+
/// Implementations should return a half-open interval: inclusive lower
719+
/// bound and exclusive upper bound. This is slightly different from normal
720+
/// [`Interval`] semantics where the upper bound is closed (inclusive).
721+
/// Typically this means the upper endpoint must be adjusted to the next
722+
/// value not included in the preimage. See the Half-Open Intervals section
723+
/// below for more details.
724+
///
725+
/// # Background
726+
///
727+
/// Inspired by the [ClickHouse Paper], a "preimage rewrite" transforms a
728+
/// predicate containing a function call into a predicate containing an
729+
/// equivalent set of input literal (constant) values. The resulting
730+
/// predicate can often be further optimized by other rewrites (see
731+
/// Examples).
732+
///
733+
/// From the paper:
734+
///
735+
/// > some functions can compute the preimage of a given function result.
736+
/// > This is used to replace comparisons of constants with function calls
737+
/// > on the key columns by comparing the key column value with the preimage.
738+
/// > For example, `toYear(k) = 2024` can be replaced by
739+
/// > `k >= 2024-01-01 && k < 2025-01-01`
740+
///
741+
/// For example, given an expression like
742+
/// ```sql
743+
/// date_part('YEAR', k) = 2024
744+
/// ```
745+
///
746+
/// The interval `[2024-01-01, 2025-12-31`]` contains all possible input
747+
/// values (preimage values) for which the function `date_part(YEAR, k)`
748+
/// produces the output value `2024` (image value). Returning the interval
749+
/// (note upper bound adjusted up) `[2024-01-01, 2025-01-01]` the expression
750+
/// can be rewritten to
751+
///
752+
/// ```sql
753+
/// k >= '2024-01-01' AND k < '2025-01-01'
754+
/// ```
755+
///
756+
/// which is a simpler and a more canonical form, making it easier for other
757+
/// optimizer passes to recognize and apply further transformations.
758+
///
759+
/// # Examples
713760
///
714-
/// A preimage is a single contiguous [`Interval`] of values where the function
715-
/// will always return `lit_value`
761+
/// Case 1:
716762
///
717-
/// Implementations should return intervals with an inclusive lower bound and
718-
/// exclusive upper bound.
763+
/// Original:
764+
/// ```sql
765+
/// date_part('YEAR', k) = 2024 AND k >= '2024-06-01'
766+
/// ```
767+
///
768+
/// After preimage rewrite:
769+
/// ```sql
770+
/// k >= '2024-01-01' AND k < '2025-01-01' AND k >= '2024-06-01'
771+
/// ```
719772
///
720-
/// This rewrite is described in the [ClickHouse Paper] and is particularly
721-
/// useful for simplifying expressions `date_part` or equivalent functions. The
722-
/// idea is that if you have an expression like `date_part(YEAR, k) = 2024` and you
723-
/// can find a [preimage] for `date_part(YEAR, k)`, which is the range of dates
724-
/// covering the entire year of 2024. Thus, you can rewrite the expression to `k
725-
/// >= '2024-01-01' AND k < '2025-01-01' which is often more optimizable.
773+
/// Since this form is much simpler, the optimizer can combine and simplify
774+
/// sub-expressions further into:
775+
/// ```sql
776+
/// k >= '2024-06-01' AND k < '2025-01-01'
777+
/// ```
778+
///
779+
/// Case 2:
726780
///
781+
/// For min/max pruning, simpler predicates such as:
782+
/// ```sql
783+
/// k >= '2024-01-01' AND k < '2025-01-01'
784+
/// ```
785+
/// are much easier for the pruner to reason about. See [PruningPredicate]
786+
/// for the backgrounds of predicate pruning.
787+
///
788+
/// The trade-off with the preimage rewrite is that evaluating the rewritten
789+
/// form might be slightly more expensive than evaluating the original
790+
/// expression. In practice, this cost is usually outweighed by the more
791+
/// aggressive optimization opportunities it enables.
792+
///
793+
/// # Half-Open Intervals
794+
///
795+
/// The preimage API uses half-open intervals, which makes the rewrite
796+
/// easier to implement by avoiding calculations to adjust the upper bound.
797+
/// For example, if a function returns its input unchanged and the desired
798+
/// output is the single value `5`, a closed interval could be represented
799+
/// as `[5, 5]`, but then the rewrite would require adjusting the upper
800+
/// bound to `6` to create a proper range predicate. With a half-open
801+
/// interval, the same range is represented as `[5, 6)`, which already
802+
/// forms a valid predicate.
803+
///
804+
/// [PruningPredicate]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
727805
/// [ClickHouse Paper]: https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
806+
/// [image]: https://en.wikipedia.org/wiki/Image_(mathematics)#Image_of_an_element
728807
/// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image
729808
fn preimage(
730809
&self,

datafusion/optimizer/src/simplify_expressions/udf_preimage.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ use datafusion_expr_common::interval_arithmetic::Interval;
2626
/// range for which it is valid) and `x` is not `NULL`
2727
///
2828
/// For details see [`datafusion_expr::ScalarUDFImpl::preimage`]
29-
///
3029
pub(super) fn rewrite_with_preimage(
3130
preimage_interval: Interval,
3231
op: Operator,

0 commit comments

Comments
 (0)