What happens?
Running a filter of type
WHERE x in (long list of integers)
hangs when running on a dataset imported from Arrow. This is new in DuckDB v1.3, and is also an issue in the latest build of v1.4.
To Reproduce
Script below will
- Write some random data in hive partitioning
- Read the data through pyarrow
- Import it to duckdb
- Run a filter of form "id in (0,4,8,12,16,...,9996)".
Step four hangs in v1.3 and beyond. I am not sure if it actually has anything to do with the hive partitioning.
import time
import duckdb
import pyarrow.dataset as ds
import tempfile
IDS = 10000
class measure_execution_time:
def __init__(self, name: str):
self._name = name
self._start_time = None
def __enter__(self):
self._start_time = time.perf_counter()
def __exit__(self, *args):
assert self._start_time is not None
end_time = time.perf_counter()
elapsed_time = end_time - self._start_time
print(f"{self._name} took {elapsed_time:.2f} seconds")
def write_data(rootdir: str):
duckdb.sql(
f"""
with ids as (
select unnest(range({IDS})) as id
),
dates as (
select unnest(range('2020-01-01'::date, '2021-01-01'::date, '1 day'::interval)) as date
),
data as (
select
*, random() as value, random() as sort_order,
date_part('year', date) as year,
date_part('month', date) as month,
date_part('day', date) as day
from ids cross join dates
order by sort_order
)
select * from data
"""
).to_table("tmptable")
duckdb.sql(f"copy tmptable to '{rootdir}' (format parquet, partition_by (year, month, day))")
duckdb.sql("drop table tmptable")
def read_data(rootdir: str):
dataset = ds.dataset(
rootdir,
partitioning="hive",
format="parquet",
)
# select every fourth id
id_filter = tuple(range(0, IDS, 4))
(
duckdb.from_arrow(dataset)
# This specific line is extremely slow on DuckDB v1.3
.filter(f"id in {id_filter}")
.to_table("tmptable")
)
if __name__ == "__main__":
with tempfile.TemporaryDirectory() as rootdir:
with measure_execution_time("Writing data"):
write_data(rootdir)
with measure_execution_time("Reading data"):
read_data(rootdir)
This code, when run with
pip install pyarrow==21.0.0 duckdb==1.2.2
prints
Writing data took 2.52 seconds
Reading data took 1.40 seconds
When run with
pip install pyarrow==21.0.0 duckdb==1.3.2
or
pip install pyarrow==21.0.0 duckdb==1.4.0.dev215
prints
Writing data took 1.92 seconds
and then hangs.
Specifically, it hangs on line 67,
.filter(f"id in {id_filter}")
This seems specific to when importing data from pyarrow. If I read the data using DuckDB's parquet reader, this is not an issue.
OS:
Ubuntu
DuckDB Version:
1.3.2
DuckDB Client:
Python 3.10
Hardware:
No response
Full Name:
Jakob Stigenberg
Affiliation:
Qubos Systematic
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
What happens?
Running a filter of type
hangs when running on a dataset imported from Arrow. This is new in DuckDB v1.3, and is also an issue in the latest build of v1.4.
To Reproduce
Script below will
Step four hangs in v1.3 and beyond. I am not sure if it actually has anything to do with the hive partitioning.
This code, when run with
prints
When run with
or
prints
and then hangs.
Specifically, it hangs on line 67,
This seems specific to when importing data from pyarrow. If I read the data using DuckDB's parquet reader, this is not an issue.
OS:
Ubuntu
DuckDB Version:
1.3.2
DuckDB Client:
Python 3.10
Hardware:
No response
Full Name:
Jakob Stigenberg
Affiliation:
Qubos Systematic
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?