What happens?
It seems like the limit relation method doesn't make things as fast as it should/could.
To Reproduce
If I prepare a 100 million row parquet:
import duckdb as ddb
ddb.sql(# sql
"""
copy (
select i from generate_series(1, 100_000_000) s(i)
) to '100million.parquet'
"""
)
and then try to preview it with a SQL limit, it's very fast:
ddb.sql("select * from '100million.parquet' limit 5").show() # fast
ddb.sql("with tbl as (select * from '100million.parquet') select * from tbl limit 5").show() # fast
however, if I use the limit option on the relation instead, things are very slow (about a minute on my midrange laptop):
ddb.sql("select * from '100million.parquet'").limit(5).show() # slow
ddb.sql("select * from '100million.parquet'").limit(5).show(max_rows=5) # also slow
I would have expected the performance behaviour of both approaches to be identical.
Note I haven't dived into where the slowness actually is - .sql(), .limit(),or .show().
Thanks for continually working on such great software!
OS:
Linux x86_64 - Pop OS 24.04
DuckDB Version:
1.4.3
DuckDB Client:
Python
Hardware:
No response
Full Name:
Jarrad Whitaker
Affiliation:
personal
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
Did you include all code required to reproduce the issue?
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
What happens?
It seems like the
limitrelation method doesn't make things as fast as it should/could.To Reproduce
If I prepare a 100 million row parquet:
and then try to preview it with a SQL limit, it's very fast:
however, if I use the
limitoption on the relation instead, things are very slow (about a minute on my midrange laptop):I would have expected the performance behaviour of both approaches to be identical.
Note I haven't dived into where the slowness actually is -
.sql(),.limit(),or.show().Thanks for continually working on such great software!
OS:
Linux x86_64 - Pop OS 24.04
DuckDB Version:
1.4.3
DuckDB Client:
Python
Hardware:
No response
Full Name:
Jarrad Whitaker
Affiliation:
personal
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
Did you include all code required to reproduce the issue?
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set