fix: Percent Encoding of paths for hive style partitioning#19651
fix: Percent Encoding of paths for hive style partitioning#19651greedAuguria wants to merge 2 commits intoapache:mainfrom
Conversation
| file_path: &'a Path, | ||
| table_partition_cols: I, | ||
| ) -> Option<Vec<&'a str>> | ||
| ) -> Option<Vec<String>> |
There was a problem hiding this comment.
nit: Not sure whether it is worth it to return a Vec<Cow<'a, str>> here ?! E.g. if the partition contains % then decode it and return Cow::Owned(decoded.into_owned()), otherwise `Cow::Borrowed(val).
There was a problem hiding this comment.
Actually this is what https://docs.rs/percent-encoding/latest/percent_encoding/struct.PercentDecode.html#method.decode_utf8 returns. It might be a good idea.
| match part.split_once('=') { | ||
| Some((name, val)) if name == expected_partition => part_values.push(val), | ||
| Some((name, val)) if name == expected_partition => { | ||
| let decoded = percent_decode_str(val).decode_utf8().ok()?; |
There was a problem hiding this comment.
Should invalid values (like %FF in one of the tests below) return None or val ?!
I think it should behave like %XX - i.e. return val
There was a problem hiding this comment.
@martin-g I am sorry for the late reply, I will get on this now! Thank you for taking a look!
- Return `Cow<'a, str>` instead of `String` to avoid unnecessary allocations when partition values have no percent-encoded characters. - Fall back to the raw value instead of returning `None` when percent-decoding produces invalid UTF-8 (e.g. `%FF`). - Fix Cargo.toml alphabetical ordering for taplo check.
8d0403b to
41fe291
Compare
|
@martin-g hey, sorry again for the late reply! I've rebased on main and addressed both of your comments: Cow<'a, str> instead of String: you were right, Fallback for invalid encodings: good catch on the The taplo CI failure was just an alphabetical ordering thing in the root Let me know if there's anything else! |
Which issue does this PR close?
Rationale for this change
Currently, when DataFusion parses Hive-style partitioned paths (e.g.,
s3://bucket/table/city=San%20Francisco/), it extracts the partition value literally asSan%20Francisco.Standard practice in tools like Apache Spark and Apache Hive is to URL-encode special characters in partition values when writing to object stores. This PR ensures DataFusion correctly decodes these values (to
San Francisco) during the listing process, preventing data mismatches and ensuring consistent behavior with other engines.What changes are included in this PR?
parse_partitions_for_pathindatafusion/catalog-listing/src/helpers.rsto percent-decode partition values.parse_partitions_for_pathfromOption<Vec<&str>>toOption<Vec<String>>because decoded strings require new allocations.datafusion-catalog-listingto accommodate the ownedStringreturn type.percent-encodingcrate todatafusion-catalog-listing.Are these changes tested?
Yes, new unit tests were added to
datafusion/catalog-listing/src/helpers.rscovering:%2Ffor/).%20.percent-encodingcrate defaults).Are there any user-facing changes?
Yes:
parse_partitions_for_pathnow returnsVec<String>instead ofVec<&str>.