You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unify Arrow stream scanning via __arrow_c_stream__ (#307)
Unify Arrow stream scanning via __arrow_c_stream__
## Problem
Related: #70
DuckDB's Python client had separate code paths for every Arrow-flavored
object type: PyArrow Table, RecordBatchReader, Scanner, Dataset,
PyCapsule, and PyCapsuleInterface. Many of these did the same thing
through different routes — materialize to a PyArrow Table, then scan it.
This made the codebase harder to extend, and objects implementing the
PyCapsule Interface (`__arrow_c_stream__`) couldn't get
projection/filter pushdown unless pyarrow.dataset was installed.
## Approach
The core design decision is to prefer `__arrow_c_stream__` as the
universal entry point rather than maintaining isinstance checks for
PyArrow Table and RecordBatchReader. Both types implement
`__arrow_c_stream__`, so they don't need dedicated branches — they fall
through to the same PyCapsuleInterface path that handles any third-party
Arrow producer (Polars, ADBC, etc.).
This collapses the type detection in `GetArrowType()` from 6
`isinstance` checks down to three (the types that don't have
`__arrow_c_stream__`):
* Scanner
* Dataset
* MessageReader
...followed by a single `hasattr(obj, "__arrow_c_stream__")` catch-all.
The PyCapsuleInterface path now has "tiered" pushdown:
- if `pyarrow.dataset` is available: import the stream as a
RecordBatchReader, feed through Scanner.from_batches for
projection/filter pushdown, export back to C stream.
- otherwise: return the raw C stream directly. DuckDB handles
projection/filter post-scan via arrow_scan_dumb.
For schema extraction we use schema._export_to_c as a fallback between
__arrow_c_schema__ and the stream-consuming fallback. This hopefully
prevents single-use streams from being consumed during schema
extraction.
~Polars DataFrames with __arrow_c_stream__ (v1.4+) now fall through to
the unified path instead of going through .to_arrow(). We keep a
fallback for Polars < 1.4.~ Edit: this resulted in a big performance
degradation. Polars doesn't seem to do zero-copy conversion and will
re-convert for every new scan. I've reverted for now.
I'll post some benchmarks tomorrow. First results look good.
0 commit comments