You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Prefer numeric in type coercion for comparisons (#20426)
## Which issue does this PR close?
- Closes#15161.
## Rationale for this change
In a comparison between a numeric column and a string literal (e.g.,
`WHERE int_col < '10'`), we previously coerced the numeric column to be
a string type. This resulted in doing a lexicographic comparison, which
results in incorrect query results.
Instead, we split type coercion into two situations: type coercion for
comparisons (including `IN` lists, `BETWEEN`, and `CASE WHEN`), where we
want string->numeric coercion, and type coercion for places like `UNION`
or `CASE ... THEN/ELSE`, where DataFusion's traditional behavior has
been to tolerate type mismatching by coercing values to strings.
Here is a (not necessarily exhaustive) summary of the behavioral changes
(old -> new):
```
Comparisons (=, <, >, etc.):
float_col = '5' : string (wrong: '5'!='5.0') -> numeric
int_col > '100' : string (wrong: '325'<'100') -> numeric
int_col = 'hello' : string, always false -> cast error
int_col = '' : string, always false -> cast error
int_col = '99.99' : string, always false -> cast error
Dict(Int) = '5' : string -> numeric
REE(Int) = '5' : string -> numeric
struct(int)=struct(str): int field to Utf8 -> str field to int
IN lists:
float_col IN ('1.0') : string (wrong: '1.0'!='1') -> numeric
str_col IN ('a', 1) : coerce to Utf8 -> coerce to Int64
CASE:
CASE str WHEN float : coerce to Utf8 -> coerce to Float
LIKE / regex:
Dict(Int) LIKE '%5%' : coerce to Utf8 -> error (matches int)
REE(Int) LIKE '%5%' : coerce to Utf8 -> error (matches int)
Dict(Int) ~ '5' : coerce to Utf8 -> error (matches int)
REE(Int) ~ '5' : error (no REE) -> error (REE added)
REE(Utf8) ~ '5' : error (no REE) -> works (REE added)
```
## What changes are included in this PR?
* Update `comparison_coercion` to coerce strings to numerics
* Remove previous `comparison_coercion_numeric` function
* Add a new function, `type_union_coercion`, and use it when appropriate
* Add support for REE types with regexp operators (this was unsupported
for no good reason I can see)
* Add unit and SLT tests for new coercion behavior
* Update existing SLT tests for changes in coercion behavior
* Fix the ClickBench unparser tests to avoid comparing int fields with
non-numeric string literals
## Are these changes tested?
Yes. New tests added, existing tests pass.
## Are there any user-facing changes?
Yes, see table above. In most cases the new behavior should be more
sensible and less error-prone, but it will likely break some user code.
---------
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
0 commit comments