What happens?
If you have a python dictionary with two keys of the same word but with different casing e.g. {"Key": "first", "key": "second"}, the last value is used for both keys in the resulting anonymous struct.
To Reproduce
Test Case
import duckdb
conn = duckdb.connect()
print(conn.execute("SELECT ?", [{"Key": "first", "key": "second"}]).fetchone())
print(conn.execute("SELECT ?", [{"key": "second", "Key": "first"}]).fetchone())
Output
({'Key': 'second', 'key': 'second'},)
({'key': 'first', 'Key': 'first'},)
Analysis
I tracked this down from the ExecuteMany to TransformPreparedParameters to finally TransformDictionaryToStruct. I'm pretty confident that I've got the call path correct given that when you binding parameters LogicalType is LogicalType::UNKNOWN.
https://github.com/duckdb/duckdb-python/blob/main/src/duckdb_py/native/python_conversion.cpp#L78
When reviewing the function you can see that the keys are placed in a case insensitive map.
case_insensitive_map_t<idx_t> key_mapping;
for (idx_t i = 0; i < struct_keys.size(); i++) {
key_mapping[struct_keys[i]] = i;
}
That mapping is then used to grab the value by index via auto val = TransformPythonValue(dict.values.attr("__getitem__")(value_index). This is why the last value wins because key_mapping always get the last index.
I'll be the first to admit my lack of knowledge around C++ and making python bindings. But the implementation of re-accessing dict.values.attr("__getitem__")(value_index) feels odd to me when the keys and values are always of equal length. This is enforced in the initial assertion of the function.
I suspected it was because the TransformStructKeys function would return a set of keys smaller than the original based on the LogicalType but when I looked it simply unpacks the values. So maybe it used to work this way but not anymore. Based on this it appears the implementation could be simplified to this:
Value TransformDictionaryToStruct(const PyDictionary &dict, const LogicalType &target_type = LogicalType::UNKNOWN) {
bool struct_target = target_type.id() == LogicalTypeId::STRUCT;
if (struct_target && dict.len != StructType::GetChildCount(target_type)) {
throw InvalidInputException("We could not convert the object %s to the desired target type (%s)",
dict.ToString(), target_type.ToString());
}
auto struct_keys = TransformStructKeys(dict.keys, dict.len, target_type);
child_list_t<Value> struct_values;
for (idx_t i = 0; i < dict.len; i++) {
auto &key = struct_target ? StructType::GetChildName(target_type, i) : struct_keys[i];
auto &child_type = struct_target ? StructType::GetChildType(target_type, i) : LogicalType::UNKNOWN;
auto val = TransformPythonValue(dict.values.attr("__getitem__")(i), child_type);
struct_values.emplace_back(make_pair(std::move(key), std::move(val)));
}
return Value::STRUCT(std::move(struct_values));
}
Probably simplified more if the string conversion in TransformStructKeys were inlined into the main for loop.
OS:
MacOS aarch64
DuckDB Package Version:
1.4.1
Python Version:
3.10.15
Full Name:
Peter Snelgrove
Affiliation:
Rapid7
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?
What happens?
If you have a python dictionary with two keys of the same word but with different casing e.g.
{"Key": "first", "key": "second"}, the last value is used for both keys in the resulting anonymous struct.To Reproduce
Test Case
Output
Analysis
I tracked this down from the
ExecuteManytoTransformPreparedParametersto finallyTransformDictionaryToStruct. I'm pretty confident that I've got the call path correct given that when you binding parametersLogicalTypeisLogicalType::UNKNOWN.https://github.com/duckdb/duckdb-python/blob/main/src/duckdb_py/native/python_conversion.cpp#L78
When reviewing the function you can see that the keys are placed in a case insensitive map.
That mapping is then used to grab the value by index via
auto val = TransformPythonValue(dict.values.attr("__getitem__")(value_index). This is why the last value wins becausekey_mappingalways get the last index.I'll be the first to admit my lack of knowledge around C++ and making python bindings. But the implementation of re-accessing
dict.values.attr("__getitem__")(value_index)feels odd to me when the keys and values are always of equal length. This is enforced in the initial assertion of the function.I suspected it was because the
TransformStructKeysfunction would return a set of keys smaller than the original based on theLogicalTypebut when I looked it simply unpacks the values. So maybe it used to work this way but not anymore. Based on this it appears the implementation could be simplified to this:Probably simplified more if the string conversion in
TransformStructKeyswere inlined into the main for loop.OS:
MacOS aarch64
DuckDB Package Version:
1.4.1
Python Version:
3.10.15
Full Name:
Peter Snelgrove
Affiliation:
Rapid7
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?