Skip to content

fix: Percent Encoding of paths for hive style partitioning#19651

Open
greedAuguria wants to merge 2 commits intoapache:mainfrom
greedAuguria:fix-partition-url-decoding
Open

fix: Percent Encoding of paths for hive style partitioning#19651
greedAuguria wants to merge 2 commits intoapache:mainfrom
greedAuguria:fix-partition-url-decoding

Conversation

@greedAuguria
Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

Currently, when DataFusion parses Hive-style partitioned paths (e.g., s3://bucket/table/city=San%20Francisco/), it extracts the partition value literally as San%20Francisco.

Standard practice in tools like Apache Spark and Apache Hive is to URL-encode special characters in partition values when writing to object stores. This PR ensures DataFusion correctly decodes these values (to San Francisco) during the listing process, preventing data mismatches and ensuring consistent behavior with other engines.

What changes are included in this PR?

  1. Logic Update: Updated parse_partitions_for_path in datafusion/catalog-listing/src/helpers.rs to percent-decode partition values.
  2. Signature Change: Changed the return type of parse_partitions_for_path from Option<Vec<&str>> to Option<Vec<String>> because decoded strings require new allocations.
  3. Call Site Updates: Updated internal callers in datafusion-catalog-listing to accommodate the owned String return type.
  4. Dependencies: Added percent-encoding crate to datafusion-catalog-listing.

Are these changes tested?

Yes, new unit tests were added to datafusion/catalog-listing/src/helpers.rs covering:

  • Standard URL-encoded characters (e.g., %2F for /).
  • Spaces encoded as %20.
  • Multi-byte UTF-8 characters.
  • Forgiving behavior for malformed encoding (matching percent-encoding crate defaults).

Are there any user-facing changes?

Yes:

  • Data Behavior: Partition values extracted from file paths will now be correctly decoded instead of remaining URL-encoded.
  • API Change: The public helper function parse_partitions_for_path now returns Vec<String> instead of Vec<&str>.

@github-actions github-actions Bot added the catalog Related to the catalog crate label Jan 5, 2026
file_path: &'a Path,
table_partition_cols: I,
) -> Option<Vec<&'a str>>
) -> Option<Vec<String>>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not sure whether it is worth it to return a Vec<Cow<'a, str>> here ?! E.g. if the partition contains % then decode it and return Cow::Owned(decoded.into_owned()), otherwise `Cow::Borrowed(val).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match part.split_once('=') {
Some((name, val)) if name == expected_partition => part_values.push(val),
Some((name, val)) if name == expected_partition => {
let decoded = percent_decode_str(val).decode_utf8().ok()?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should invalid values (like %FF in one of the tests below) return None or val ?!
I think it should behave like %XX - i.e. return val

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-g I am sorry for the late reply, I will get on this now! Thank you for taking a look!

greedAuguria and others added 2 commits April 4, 2026 11:18
- Return `Cow<'a, str>` instead of `String` to avoid unnecessary
  allocations when partition values have no percent-encoded characters.
- Fall back to the raw value instead of returning `None` when
  percent-decoding produces invalid UTF-8 (e.g. `%FF`).
- Fix Cargo.toml alphabetical ordering for taplo check.
@greedAuguria greedAuguria force-pushed the fix-partition-url-decoding branch from 8d0403b to 41fe291 Compare April 4, 2026 08:28
@greedAuguria
Copy link
Copy Markdown
Author

greedAuguria commented Apr 4, 2026

@martin-g hey, sorry again for the late reply! I've rebased on main and addressed both of your comments:

Cow<'a, str> instead of String: you were right, percent_decode_str().decode_utf8() already hands back a Cow<str>, so it's a natural fit. Now we only allocate when the value actually had encoded characters in it. Most partition values won't, so this should be a nice win for the common case. Updated the caller to use into_owned() instead of to_string() too so we don't double-allocate on the owned path.

Fallback for invalid encodings: good catch on the %FF case. Previously .ok()? would bail out and return None for the whole file, which silently drops it from query results, definitely not what we want. Now if percent-decoding produces invalid UTF-8, we just pass through the raw value as-is (via unwrap_or(Cow::Borrowed(val))). So %FF comes back as the literal string %FF instead of the file disappearing. Same idea as your suggestion to treat it like %XX.

The taplo CI failure was just an alphabetical ordering thing in the root Cargo.toml: fixed during the rebase.

Let me know if there's anything else!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partition values are not URL-decoded when extracted from Hive-style paths

3 participants