Skip to content

fix: string_to_array('', delim) returns empty array for PostgreSQL compatibility#21104

Merged
gabotechs merged 1 commit intoapache:mainfrom
dd-david-levin:fix/string-to-array-empty-string
Mar 24, 2026
Merged

fix: string_to_array('', delim) returns empty array for PostgreSQL compatibility#21104
gabotechs merged 1 commit intoapache:mainfrom
dd-david-levin:fix/string-to-array-empty-string

Conversation

@dd-david-levin
Copy link
Copy Markdown
Contributor

@dd-david-levin dd-david-levin commented Mar 22, 2026

Problem

string_to_array was returning incorrect results for empty string input — both when the delimiter is non-empty and when the delimiter is itself an empty string. This diverges from PostgreSQL behavior.

Query DataFusion (before) PostgreSQL (expected)
string_to_array('', ',') [''] {}
string_to_array('', '') [''] {}
string_to_array('', ',', 'x') [''] {}
string_to_array('', '', 'x') [''] {}

Results from datafusion-cli
Screenshot 2026-03-23 at 9 14 08 AM

Root cause: Rust's str::split() on an empty string always yields one empty-string element, so "".split(",") produces [""]. Additionally, the empty-delimiter branch unconditionally appended the (empty) string value. This is subtle because Arrow's text display format appears not to quote strings, so [""] renders as [] — indistinguishable from an actual empty array. Using cardinality() reveals the current length is 1, not 0.

PostgreSQL reference: db-fiddle

Fix

In datafusion/functions-nested/src/string.rs:

  • Non-empty delimiter (Some(string), Some(delimiter)): added if !string.is_empty() guard to skip splitting when input is empty.
  • Empty delimiter (Some(string), Some("")): added if !string.is_empty() guard so the string value is only appended when non-empty.

Both the plain variant and the null_value variant are fixed (4 checks total).

Tests

Added sqllogictest cases in datafusion/sqllogictest/test_files/array.slt using cardinality() to unambiguously verify the arrays are truly empty (not just displaying as empty):

SELECT cardinality(string_to_array('', ','))    -- 0
SELECT cardinality(string_to_array('', ''))     -- 0
SELECT cardinality(string_to_array('', ',', 'x'))  -- 0
SELECT cardinality(string_to_array('', '', 'x'))   -- 0

Each test covers one of the four is_empty guard checks. All four fail without the fix (returning 1 instead of 0).

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 22, 2026
@dd-david-levin dd-david-levin force-pushed the fix/string-to-array-empty-string branch from 447b2d2 to a83d162 Compare March 23, 2026 13:07
@dd-david-levin dd-david-levin marked this pull request as ready for review March 23, 2026 13:12
…mpatibility

Rust's str::split() on an empty string always yields one empty-string
element, so "".split(",") produces [""]. The empty-delimiter branch
also unconditionally appended the (empty) string value. Both cases
now guard with !string.is_empty() to return a truly empty array,
matching PostgreSQL behavior.

Tests use cardinality() to unambiguously verify the result since
Arrow's text format renders [""] identically to [].
@dd-david-levin dd-david-levin force-pushed the fix/string-to-array-empty-string branch from a83d162 to a4d555f Compare March 23, 2026 14:37
Copy link
Copy Markdown
Contributor

@LiaCastaneda LiaCastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me

@gabotechs
Copy link
Copy Markdown
Contributor

Looks good. Thanks @dd-david-levin for the PR and @LiaCastaneda for the review! I'll leave this here until tomorrow in case someone else wants to chime in.

@gabotechs gabotechs added this pull request to the merge queue Mar 24, 2026
Merged via the queue into apache:main with commit cdaecf0 Mar 24, 2026
30 checks passed
de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
…mpatibility (apache#21104)

## Problem

`string_to_array` was returning incorrect results for empty string input
— both when the delimiter is non-empty and when the delimiter is itself
an empty string. This diverges from PostgreSQL behavior.

| Query | DataFusion (before) | PostgreSQL (expected) |
|---|---|---|
| `string_to_array('', ',')` | `['']` | `{}` |
| `string_to_array('', '')` | `['']` | `{}` |
| `string_to_array('', ',', 'x')` | `['']` | `{}` |
| `string_to_array('', '', 'x')` | `['']` | `{}` |

Results from datafusion-cli
<img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM"
src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712"
/>

**Root cause:** Rust's `str::split()` on an empty string always yields
one empty-string element, so `"".split(",")` produces `[""]`.
Additionally, the empty-delimiter branch unconditionally appended the
(empty) string value. This is subtle because Arrow's text display format
appears not to quote strings, so `[""]` renders as `[]` —
indistinguishable from an actual empty array. Using `cardinality()`
reveals the current length is 1, not 0.

**PostgreSQL reference:**
[db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3)

## Fix

In `datafusion/functions-nested/src/string.rs`:

- **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if
!string.is_empty()` guard to skip splitting when input is empty.
- **Empty delimiter** `(Some(string), Some(""))`: added `if
!string.is_empty()` guard so the string value is only appended when
non-empty.

Both the plain variant and the `null_value` variant are fixed (4 checks
total).

## Tests

Added sqllogictest cases in
`datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to
unambiguously verify the arrays are truly empty (not just displaying as
empty):

```sql
SELECT cardinality(string_to_array('', ','))    -- 0
SELECT cardinality(string_to_array('', ''))     -- 0
SELECT cardinality(string_to_array('', ',', 'x'))  -- 0
SELECT cardinality(string_to_array('', '', 'x'))   -- 0
```

Each test covers one of the four `is_empty` guard checks. All four fail
without the fix (returning 1 instead of 0).
dd-david-levin added a commit to dd-david-levin/datafusion that referenced this pull request Mar 25, 2026
…mpatibility (apache#21104)

## Problem

`string_to_array` was returning incorrect results for empty string input
— both when the delimiter is non-empty and when the delimiter is itself
an empty string. This diverges from PostgreSQL behavior.

| Query | DataFusion (before) | PostgreSQL (expected) |
|---|---|---|
| `string_to_array('', ',')` | `['']` | `{}` |
| `string_to_array('', '')` | `['']` | `{}` |
| `string_to_array('', ',', 'x')` | `['']` | `{}` |
| `string_to_array('', '', 'x')` | `['']` | `{}` |

Results from datafusion-cli
<img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM"
src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712"
/>

**Root cause:** Rust's `str::split()` on an empty string always yields
one empty-string element, so `"".split(",")` produces `[""]`.
Additionally, the empty-delimiter branch unconditionally appended the
(empty) string value. This is subtle because Arrow's text display format
appears not to quote strings, so `[""]` renders as `[]` —
indistinguishable from an actual empty array. Using `cardinality()`
reveals the current length is 1, not 0.

**PostgreSQL reference:**
[db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3)

## Fix

In `datafusion/functions-nested/src/string.rs`:

- **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if
!string.is_empty()` guard to skip splitting when input is empty.
- **Empty delimiter** `(Some(string), Some(""))`: added `if
!string.is_empty()` guard so the string value is only appended when
non-empty.

Both the plain variant and the `null_value` variant are fixed (4 checks
total).

## Tests

Added sqllogictest cases in
`datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to
unambiguously verify the arrays are truly empty (not just displaying as
empty):

```sql
SELECT cardinality(string_to_array('', ','))    -- 0
SELECT cardinality(string_to_array('', ''))     -- 0
SELECT cardinality(string_to_array('', ',', 'x'))  -- 0
SELECT cardinality(string_to_array('', '', 'x'))   -- 0
```

Each test covers one of the four `is_empty` guard checks. All four fail
without the fix (returning 1 instead of 0).

(cherry picked from commit cdaecf0)
dd-david-levin added a commit to dd-david-levin/datafusion that referenced this pull request Mar 26, 2026
…mpatibility (apache#21104)

## Problem

`string_to_array` was returning incorrect results for empty string input
— both when the delimiter is non-empty and when the delimiter is itself
an empty string. This diverges from PostgreSQL behavior.

| Query | DataFusion (before) | PostgreSQL (expected) |
|---|---|---|
| `string_to_array('', ',')` | `['']` | `{}` |
| `string_to_array('', '')` | `['']` | `{}` |
| `string_to_array('', ',', 'x')` | `['']` | `{}` |
| `string_to_array('', '', 'x')` | `['']` | `{}` |

Results from datafusion-cli
<img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM"
src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712"
/>

**Root cause:** Rust's `str::split()` on an empty string always yields
one empty-string element, so `"".split(",")` produces `[""]`.
Additionally, the empty-delimiter branch unconditionally appended the
(empty) string value. This is subtle because Arrow's text display format
appears not to quote strings, so `[""]` renders as `[]` —
indistinguishable from an actual empty array. Using `cardinality()`
reveals the current length is 1, not 0.

**PostgreSQL reference:**
[db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3)

## Fix

In `datafusion/functions-nested/src/string.rs`:

- **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if
!string.is_empty()` guard to skip splitting when input is empty.
- **Empty delimiter** `(Some(string), Some(""))`: added `if
!string.is_empty()` guard so the string value is only appended when
non-empty.

Both the plain variant and the `null_value` variant are fixed (4 checks
total).

## Tests

Added sqllogictest cases in
`datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to
unambiguously verify the arrays are truly empty (not just displaying as
empty):

```sql
SELECT cardinality(string_to_array('', ','))    -- 0
SELECT cardinality(string_to_array('', ''))     -- 0
SELECT cardinality(string_to_array('', ',', 'x'))  -- 0
SELECT cardinality(string_to_array('', '', 'x'))   -- 0
```

Each test covers one of the four `is_empty` guard checks. All four fail
without the fix (returning 1 instead of 0).

(cherry picked from commit cdaecf0)
gh-worker-dd-mergequeue-cf854d bot added a commit to DataDog/datafusion that referenced this pull request Mar 26, 2026
…ache-pr-21104-20260325

Cherry-pick apache#21104

Co-authored-by: dd-david-levin <david.levin@datadoghq.com>
gabotechs pushed a commit to DataDog/datafusion that referenced this pull request Apr 16, 2026
…mpatibility (apache#21104)

## Problem

`string_to_array` was returning incorrect results for empty string input
— both when the delimiter is non-empty and when the delimiter is itself
an empty string. This diverges from PostgreSQL behavior.

| Query | DataFusion (before) | PostgreSQL (expected) |
|---|---|---|
| `string_to_array('', ',')` | `['']` | `{}` |
| `string_to_array('', '')` | `['']` | `{}` |
| `string_to_array('', ',', 'x')` | `['']` | `{}` |
| `string_to_array('', '', 'x')` | `['']` | `{}` |

Results from datafusion-cli
<img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM"
src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712"
/>

**Root cause:** Rust's `str::split()` on an empty string always yields
one empty-string element, so `"".split(",")` produces `[""]`.
Additionally, the empty-delimiter branch unconditionally appended the
(empty) string value. This is subtle because Arrow's text display format
appears not to quote strings, so `[""]` renders as `[]` —
indistinguishable from an actual empty array. Using `cardinality()`
reveals the current length is 1, not 0.

**PostgreSQL reference:**
[db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3)

## Fix

In `datafusion/functions-nested/src/string.rs`:

- **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if
!string.is_empty()` guard to skip splitting when input is empty.
- **Empty delimiter** `(Some(string), Some(""))`: added `if
!string.is_empty()` guard so the string value is only appended when
non-empty.

Both the plain variant and the `null_value` variant are fixed (4 checks
total).

## Tests

Added sqllogictest cases in
`datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to
unambiguously verify the arrays are truly empty (not just displaying as
empty):

```sql
SELECT cardinality(string_to_array('', ','))    -- 0
SELECT cardinality(string_to_array('', ''))     -- 0
SELECT cardinality(string_to_array('', ',', 'x'))  -- 0
SELECT cardinality(string_to_array('', '', 'x'))   -- 0
```

Each test covers one of the four `is_empty` guard checks. All four fail
without the fix (returning 1 instead of 0).

(cherry picked from commit cdaecf0)
gabotechs pushed a commit to DataDog/datafusion that referenced this pull request Apr 16, 2026
…mpatibility (apache#21104)

## Problem

`string_to_array` was returning incorrect results for empty string input
— both when the delimiter is non-empty and when the delimiter is itself
an empty string. This diverges from PostgreSQL behavior.

| Query | DataFusion (before) | PostgreSQL (expected) |
|---|---|---|
| `string_to_array('', ',')` | `['']` | `{}` |
| `string_to_array('', '')` | `['']` | `{}` |
| `string_to_array('', ',', 'x')` | `['']` | `{}` |
| `string_to_array('', '', 'x')` | `['']` | `{}` |

Results from datafusion-cli
<img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM"
src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712"
/>

**Root cause:** Rust's `str::split()` on an empty string always yields
one empty-string element, so `"".split(",")` produces `[""]`.
Additionally, the empty-delimiter branch unconditionally appended the
(empty) string value. This is subtle because Arrow's text display format
appears not to quote strings, so `[""]` renders as `[]` —
indistinguishable from an actual empty array. Using `cardinality()`
reveals the current length is 1, not 0.

**PostgreSQL reference:**
[db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3)

## Fix

In `datafusion/functions-nested/src/string.rs`:

- **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if
!string.is_empty()` guard to skip splitting when input is empty.
- **Empty delimiter** `(Some(string), Some(""))`: added `if
!string.is_empty()` guard so the string value is only appended when
non-empty.

Both the plain variant and the `null_value` variant are fixed (4 checks
total).

## Tests

Added sqllogictest cases in
`datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to
unambiguously verify the arrays are truly empty (not just displaying as
empty):

```sql
SELECT cardinality(string_to_array('', ','))    -- 0
SELECT cardinality(string_to_array('', ''))     -- 0
SELECT cardinality(string_to_array('', ',', 'x'))  -- 0
SELECT cardinality(string_to_array('', '', 'x'))   -- 0
```

Each test covers one of the four `is_empty` guard checks. All four fail
without the fix (returning 1 instead of 0).

(cherry picked from commit cdaecf0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants