What happens?
I have an application where I query a mysql table (via the standard ATTACH method for the mysql extension) and build a local reverse index for fast local lookups. It was impossible for two of my columns to be NULL so I updated my duckdb table schema to mirror the mysql schema.
My tests blew up because I got a None value in my arn column on the first and only the first row. Playing around with the WHERE clause and I was able to get the primary key resource_id to come back as null. Regardless of SELECT ordering or WHERE conditions, lowering the resource_id string length < 64 characters and the data populates just fine.
Here are some other observations:
- I get this result using any of the python interfaces for reading the query result like
fetchone, fetchmany, or arrow.
- The bug does NOT reproduce when using the DuckDB CLI!
- When using CLI with 65+ char resource_id: ALL VALUES ARE RETURNED CORRECTLY
- This suggests the bug is specific to the Python extension
- The issue is likely in how the Python bindings handle result sets from MySQL
- See kms_debugging.sh for CLI reproduction test
- mysql column varchar length and character set don't appear to have an effect.
- I messed around with table values a bunch so it doesn't feel like an issue with the table and column catalog.
To Reproduce
My example uses pytest because it was easier to create a fast feedback loop with something like watchexec as I was messing with different values.
uv venv
uv pip install pytest duckdb
watchexec -c -- pytest -vvv kms_debugging.py
The test is setup with a fixture to setup and teardown the db. It expects that you have a mysql instance running locally with a database named test. I'm running an older version of mysql 8.0.42. I'll try testing against a newer version.
Each test is parameterized to run twice. First with a string 64 characters long and the second 65 characters long which will cause the failure. I kept it this way so it was easy to add or modify those inputs..
I want to draw attention to the third test which demonstrates adding arn IS NOT NULL causes the null column to change.
from __future__ import annotations
import itertools
from dataclasses import dataclass
import duckdb
import pytest
@dataclass
class ExampleRow:
# [0, 64] will pass
# [65, 255] will change the value to None
# [256, ~] will exceed the column size
resource_id: str
region_name: str
arn: str
@classmethod
def from_size(cls, index: int, resource_id_length: int) -> ExampleRow:
return cls(
"".join(itertools.repeat(str(index), resource_id_length)),
"us-east-1",
"arn:aws:kms:us-east-1:123456789012:key/00000000-0000-0000-0000-000000000000",
)
def as_tuple(self) -> tuple[str, str, str]:
return (self.resource_id, self.region_name, self.arn)
@pytest.fixture()
def conn():
# Config
host = "localhost"
port = "3306"
user = "root"
database = "test"
# Setup
conn = duckdb.connect("test.db")
conn.sql(f"ATTACH 'host={host} user={user} port={port} database={database}' AS mysql (TYPE mysql)")
conn.sql("SET mysql_experimental_filter_pushdown = true")
conn.sql("SET mysql_debug_show_queries = true")
conn.sql(""" CALL mysql_execute( "mysql", "DROP TABLE IF EXISTS example" ) """)
conn.sql("""
CALL mysql_execute(
"mysql",
"CREATE TABLE example (
resource_id varchar(255) NOT NULL,
region_name varchar(64) NOT NULL,
arn varchar(1024) DEFAULT NULL
)"
)
""")
yield conn
conn.close()
@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_does_param_binding_matter(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
"""
Does using prepared parameter binding matter? It appears not I get the same incorrect result in both cases.
"""
example_row = ExampleRow.from_size(0, resource_id_size)
conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
bug = (
example_row.resource_id,
# `arn` will be None if len(resource_id) > 64
example_row.arn,
)
# Test
with_binding = "SELECT resource_id, arn FROM mysql.example WHERE region_name = ?"
assert conn.execute(with_binding, [example_row.region_name]).fetchone() == bug
without_binding = f"SELECT resource_id, arn FROM mysql.example WHERE region_name = '{example_row.region_name}'"
assert conn.sql(without_binding).fetchone() == bug
@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_does_it_work_with_mysql_query(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
"""
If I bypass duckdb and go directly to mysql using `mysql_query` it does give me the correct value
"""
example_row = ExampleRow.from_size(0, resource_id_size)
conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
query = f"""CALL mysql_query("mysql", "SELECT resource_id, arn FROM example WHERE region_name = '{example_row.region_name}'") """
assert conn.sql(query).fetchone() == (example_row.resource_id, example_row.arn)
@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_select_order(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
"""
Changing the select order, still gives you null resource_id.
"""
example_row = ExampleRow.from_size(0, resource_id_size)
conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
# SELECT `region_name`, `arn`, `resource_id` FROM `test`.`example` WHERE `region_name` = 'us-east-1';
result = conn.execute(
"SELECT arn, resource_id FROM mysql.example WHERE region_name = ?",
[example_row.region_name],
).fetchone()
assert result == (example_row.arn, example_row.resource_id)
# No filtering but resource_id is still null
# SELECT `resource_id`, `arn` FROM `test`.`example`;
cur = conn.sql("SELECT resource_id, arn FROM mysql.example")
assert cur.fetchone() == (
example_row.resource_id,
example_row.arn,
)
# when you include the `arn` in the where clause the null field becomes the resource id and the arn is now non-null
cur = conn.execute(
"SELECT resource_id, arn FROM mysql.example WHERE region_name = ? AND arn IS NOT NULL",
[example_row.region_name],
)
result = cur.fetchone()
assert result == (
example_row.resource_id,
example_row.arn,
)
CLI Script
I've also included a CLI repro example to show this doesn't happen when outside of python.
kms_debugging.sh
OS:
MacOS aarch64
DuckDB Package Version:
1.4.1
Python Version:
3.10.15
Full Name:
Peter Snelgrove
Affiliation:
Rapid7
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?
What happens?
I have an application where I query a mysql table (via the standard
ATTACHmethod for the mysql extension) and build a local reverse index for fast local lookups. It was impossible for two of my columns to beNULLso I updated my duckdb table schema to mirror the mysql schema.My tests blew up because I got a
Nonevalue in myarncolumn on the first and only the first row. Playing around with theWHEREclause and I was able to get the primary keyresource_idto come back as null. Regardless ofSELECTordering orWHEREconditions, lowering theresource_idstring length < 64 characters and the data populates just fine.Here are some other observations:
fetchone,fetchmany, orarrow.To Reproduce
My example uses
pytestbecause it was easier to create a fast feedback loop with something likewatchexecas I was messing with different values.The test is setup with a fixture to setup and teardown the db. It expects that you have a
mysqlinstance running locally with a database namedtest. I'm running an older version of mysql8.0.42. I'll try testing against a newer version.Each test is parameterized to run twice. First with a string 64 characters long and the second 65 characters long which will cause the failure. I kept it this way so it was easy to add or modify those inputs..
I want to draw attention to the third test which demonstrates adding
arn IS NOT NULLcauses the null column to change.CLI Script
I've also included a CLI repro example to show this doesn't happen when outside of python.
kms_debugging.sh
OS:
MacOS aarch64
DuckDB Package Version:
1.4.1
Python Version:
3.10.15
Full Name:
Peter Snelgrove
Affiliation:
Rapid7
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?