fix: string_to_array('', delim) returns empty array for PostgreSQL compatibility#21104
Merged
gabotechs merged 1 commit intoapache:mainfrom Mar 24, 2026
Merged
Conversation
447b2d2 to
a83d162
Compare
…mpatibility
Rust's str::split() on an empty string always yields one empty-string
element, so "".split(",") produces [""]. The empty-delimiter branch
also unconditionally appended the (empty) string value. Both cases
now guard with !string.is_empty() to return a truly empty array,
matching PostgreSQL behavior.
Tests use cardinality() to unambiguously verify the result since
Arrow's text format renders [""] identically to [].
a83d162 to
a4d555f
Compare
LiaCastaneda
approved these changes
Mar 23, 2026
Contributor
LiaCastaneda
left a comment
There was a problem hiding this comment.
This makes sense to me
gabotechs
approved these changes
Mar 23, 2026
Contributor
|
Looks good. Thanks @dd-david-levin for the PR and @LiaCastaneda for the review! I'll leave this here until tomorrow in case someone else wants to chime in. |
de-bgunter
pushed a commit
to de-bgunter/datafusion
that referenced
this pull request
Mar 24, 2026
…mpatibility (apache#21104) ## Problem `string_to_array` was returning incorrect results for empty string input — both when the delimiter is non-empty and when the delimiter is itself an empty string. This diverges from PostgreSQL behavior. | Query | DataFusion (before) | PostgreSQL (expected) | |---|---|---| | `string_to_array('', ',')` | `['']` | `{}` | | `string_to_array('', '')` | `['']` | `{}` | | `string_to_array('', ',', 'x')` | `['']` | `{}` | | `string_to_array('', '', 'x')` | `['']` | `{}` | Results from datafusion-cli <img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM" src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712" /> **Root cause:** Rust's `str::split()` on an empty string always yields one empty-string element, so `"".split(",")` produces `[""]`. Additionally, the empty-delimiter branch unconditionally appended the (empty) string value. This is subtle because Arrow's text display format appears not to quote strings, so `[""]` renders as `[]` — indistinguishable from an actual empty array. Using `cardinality()` reveals the current length is 1, not 0. **PostgreSQL reference:** [db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3) ## Fix In `datafusion/functions-nested/src/string.rs`: - **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if !string.is_empty()` guard to skip splitting when input is empty. - **Empty delimiter** `(Some(string), Some(""))`: added `if !string.is_empty()` guard so the string value is only appended when non-empty. Both the plain variant and the `null_value` variant are fixed (4 checks total). ## Tests Added sqllogictest cases in `datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to unambiguously verify the arrays are truly empty (not just displaying as empty): ```sql SELECT cardinality(string_to_array('', ',')) -- 0 SELECT cardinality(string_to_array('', '')) -- 0 SELECT cardinality(string_to_array('', ',', 'x')) -- 0 SELECT cardinality(string_to_array('', '', 'x')) -- 0 ``` Each test covers one of the four `is_empty` guard checks. All four fail without the fix (returning 1 instead of 0).
dd-david-levin
added a commit
to dd-david-levin/datafusion
that referenced
this pull request
Mar 25, 2026
…mpatibility (apache#21104) ## Problem `string_to_array` was returning incorrect results for empty string input — both when the delimiter is non-empty and when the delimiter is itself an empty string. This diverges from PostgreSQL behavior. | Query | DataFusion (before) | PostgreSQL (expected) | |---|---|---| | `string_to_array('', ',')` | `['']` | `{}` | | `string_to_array('', '')` | `['']` | `{}` | | `string_to_array('', ',', 'x')` | `['']` | `{}` | | `string_to_array('', '', 'x')` | `['']` | `{}` | Results from datafusion-cli <img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM" src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712" /> **Root cause:** Rust's `str::split()` on an empty string always yields one empty-string element, so `"".split(",")` produces `[""]`. Additionally, the empty-delimiter branch unconditionally appended the (empty) string value. This is subtle because Arrow's text display format appears not to quote strings, so `[""]` renders as `[]` — indistinguishable from an actual empty array. Using `cardinality()` reveals the current length is 1, not 0. **PostgreSQL reference:** [db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3) ## Fix In `datafusion/functions-nested/src/string.rs`: - **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if !string.is_empty()` guard to skip splitting when input is empty. - **Empty delimiter** `(Some(string), Some(""))`: added `if !string.is_empty()` guard so the string value is only appended when non-empty. Both the plain variant and the `null_value` variant are fixed (4 checks total). ## Tests Added sqllogictest cases in `datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to unambiguously verify the arrays are truly empty (not just displaying as empty): ```sql SELECT cardinality(string_to_array('', ',')) -- 0 SELECT cardinality(string_to_array('', '')) -- 0 SELECT cardinality(string_to_array('', ',', 'x')) -- 0 SELECT cardinality(string_to_array('', '', 'x')) -- 0 ``` Each test covers one of the four `is_empty` guard checks. All four fail without the fix (returning 1 instead of 0). (cherry picked from commit cdaecf0)
dd-david-levin
added a commit
to dd-david-levin/datafusion
that referenced
this pull request
Mar 26, 2026
…mpatibility (apache#21104) ## Problem `string_to_array` was returning incorrect results for empty string input — both when the delimiter is non-empty and when the delimiter is itself an empty string. This diverges from PostgreSQL behavior. | Query | DataFusion (before) | PostgreSQL (expected) | |---|---|---| | `string_to_array('', ',')` | `['']` | `{}` | | `string_to_array('', '')` | `['']` | `{}` | | `string_to_array('', ',', 'x')` | `['']` | `{}` | | `string_to_array('', '', 'x')` | `['']` | `{}` | Results from datafusion-cli <img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM" src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712" /> **Root cause:** Rust's `str::split()` on an empty string always yields one empty-string element, so `"".split(",")` produces `[""]`. Additionally, the empty-delimiter branch unconditionally appended the (empty) string value. This is subtle because Arrow's text display format appears not to quote strings, so `[""]` renders as `[]` — indistinguishable from an actual empty array. Using `cardinality()` reveals the current length is 1, not 0. **PostgreSQL reference:** [db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3) ## Fix In `datafusion/functions-nested/src/string.rs`: - **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if !string.is_empty()` guard to skip splitting when input is empty. - **Empty delimiter** `(Some(string), Some(""))`: added `if !string.is_empty()` guard so the string value is only appended when non-empty. Both the plain variant and the `null_value` variant are fixed (4 checks total). ## Tests Added sqllogictest cases in `datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to unambiguously verify the arrays are truly empty (not just displaying as empty): ```sql SELECT cardinality(string_to_array('', ',')) -- 0 SELECT cardinality(string_to_array('', '')) -- 0 SELECT cardinality(string_to_array('', ',', 'x')) -- 0 SELECT cardinality(string_to_array('', '', 'x')) -- 0 ``` Each test covers one of the four `is_empty` guard checks. All four fail without the fix (returning 1 instead of 0). (cherry picked from commit cdaecf0)
gh-worker-dd-mergequeue-cf854d bot
added a commit
to DataDog/datafusion
that referenced
this pull request
Mar 26, 2026
…ache-pr-21104-20260325 Cherry-pick apache#21104 Co-authored-by: dd-david-levin <david.levin@datadoghq.com>
gabotechs
pushed a commit
to DataDog/datafusion
that referenced
this pull request
Apr 16, 2026
…mpatibility (apache#21104) ## Problem `string_to_array` was returning incorrect results for empty string input — both when the delimiter is non-empty and when the delimiter is itself an empty string. This diverges from PostgreSQL behavior. | Query | DataFusion (before) | PostgreSQL (expected) | |---|---|---| | `string_to_array('', ',')` | `['']` | `{}` | | `string_to_array('', '')` | `['']` | `{}` | | `string_to_array('', ',', 'x')` | `['']` | `{}` | | `string_to_array('', '', 'x')` | `['']` | `{}` | Results from datafusion-cli <img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM" src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712" /> **Root cause:** Rust's `str::split()` on an empty string always yields one empty-string element, so `"".split(",")` produces `[""]`. Additionally, the empty-delimiter branch unconditionally appended the (empty) string value. This is subtle because Arrow's text display format appears not to quote strings, so `[""]` renders as `[]` — indistinguishable from an actual empty array. Using `cardinality()` reveals the current length is 1, not 0. **PostgreSQL reference:** [db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3) ## Fix In `datafusion/functions-nested/src/string.rs`: - **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if !string.is_empty()` guard to skip splitting when input is empty. - **Empty delimiter** `(Some(string), Some(""))`: added `if !string.is_empty()` guard so the string value is only appended when non-empty. Both the plain variant and the `null_value` variant are fixed (4 checks total). ## Tests Added sqllogictest cases in `datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to unambiguously verify the arrays are truly empty (not just displaying as empty): ```sql SELECT cardinality(string_to_array('', ',')) -- 0 SELECT cardinality(string_to_array('', '')) -- 0 SELECT cardinality(string_to_array('', ',', 'x')) -- 0 SELECT cardinality(string_to_array('', '', 'x')) -- 0 ``` Each test covers one of the four `is_empty` guard checks. All four fail without the fix (returning 1 instead of 0). (cherry picked from commit cdaecf0)
gabotechs
pushed a commit
to DataDog/datafusion
that referenced
this pull request
Apr 16, 2026
…mpatibility (apache#21104) ## Problem `string_to_array` was returning incorrect results for empty string input — both when the delimiter is non-empty and when the delimiter is itself an empty string. This diverges from PostgreSQL behavior. | Query | DataFusion (before) | PostgreSQL (expected) | |---|---|---| | `string_to_array('', ',')` | `['']` | `{}` | | `string_to_array('', '')` | `['']` | `{}` | | `string_to_array('', ',', 'x')` | `['']` | `{}` | | `string_to_array('', '', 'x')` | `['']` | `{}` | Results from datafusion-cli <img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM" src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712" /> **Root cause:** Rust's `str::split()` on an empty string always yields one empty-string element, so `"".split(",")` produces `[""]`. Additionally, the empty-delimiter branch unconditionally appended the (empty) string value. This is subtle because Arrow's text display format appears not to quote strings, so `[""]` renders as `[]` — indistinguishable from an actual empty array. Using `cardinality()` reveals the current length is 1, not 0. **PostgreSQL reference:** [db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3) ## Fix In `datafusion/functions-nested/src/string.rs`: - **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if !string.is_empty()` guard to skip splitting when input is empty. - **Empty delimiter** `(Some(string), Some(""))`: added `if !string.is_empty()` guard so the string value is only appended when non-empty. Both the plain variant and the `null_value` variant are fixed (4 checks total). ## Tests Added sqllogictest cases in `datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to unambiguously verify the arrays are truly empty (not just displaying as empty): ```sql SELECT cardinality(string_to_array('', ',')) -- 0 SELECT cardinality(string_to_array('', '')) -- 0 SELECT cardinality(string_to_array('', ',', 'x')) -- 0 SELECT cardinality(string_to_array('', '', 'x')) -- 0 ``` Each test covers one of the four `is_empty` guard checks. All four fail without the fix (returning 1 instead of 0). (cherry picked from commit cdaecf0)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
string_to_arraywas returning incorrect results for empty string input — both when the delimiter is non-empty and when the delimiter is itself an empty string. This diverges from PostgreSQL behavior.string_to_array('', ',')['']{}string_to_array('', '')['']{}string_to_array('', ',', 'x')['']{}string_to_array('', '', 'x')['']{}Results from datafusion-cli

Root cause: Rust's
str::split()on an empty string always yields one empty-string element, so"".split(",")produces[""]. Additionally, the empty-delimiter branch unconditionally appended the (empty) string value. This is subtle because Arrow's text display format appears not to quote strings, so[""]renders as[]— indistinguishable from an actual empty array. Usingcardinality()reveals the current length is 1, not 0.PostgreSQL reference: db-fiddle
Fix
In
datafusion/functions-nested/src/string.rs:(Some(string), Some(delimiter)): addedif !string.is_empty()guard to skip splitting when input is empty.(Some(string), Some("")): addedif !string.is_empty()guard so the string value is only appended when non-empty.Both the plain variant and the
null_valuevariant are fixed (4 checks total).Tests
Added sqllogictest cases in
datafusion/sqllogictest/test_files/array.sltusingcardinality()to unambiguously verify the arrays are truly empty (not just displaying as empty):Each test covers one of the four
is_emptyguard checks. All four fail without the fix (returning 1 instead of 0).