Conversation
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @athlcode. Since this changes the current behavior, I think it would be a good opportunity to add a config that allows switching between exception and last key wins (e.g., datafusion.execution.mapKeyDedupPolicy). Otherwise existing code relying on the last key wins policy won't be able to adapt.
cc: @Jefffrey and @comphead since you both reviewed the original map_from_entries.
coderfender
left a comment
There was a problem hiding this comment.
The PR looks good @athlcode . Left some comments and I would like to see some more tests covering various edge cases both on. the rust side and SLT side to add more confidence
…ort/MapFromEntries
yep fixing comments and will add tests |
thanks @nuno-faria. I am looking into this as well |
Which issue does this PR close?
Closes # (none: prerequisite for apache/datafusion-comet#2706; follow-up to #17779 and #19274).
Rationale for this change
Spark's default mapKeyDedupPolicy is EXCEPTION, raising SparkRuntimeException with error class DUPLICATED_MAP_KEY on duplicate map keys. The existing Spark map_from_entries / map_from_arrays UDFs silently kept the last occurrence, forcing downstream engines like datafusion-comet to fall back to Spark.
What changes are included in this PR?
Duplicate map keys in Spark map_from_entries, map_from_arrays, and str_to_map now raise [DUPLICATED_MAP_KEY] Duplicate map key {key} was found, matching Spark's default behaviour and error class.
Are these changes tested?
Yes, via sqllogictest assertions covering the new error across the affected Spark map UDFs.
Are there any user-facing changes?
Yes. Duplicate keys now raise [DUPLICATED_MAP_KEY] under the default policy instead of silently collapsing to the last occurrence. No new config keys, no API changes.