Skip to content

Commit 07d38c7

Browse files
timsaucerclaude
andcommitted
Address review feedback: doctest, streaming, date/timestamp
- Convert the __init__.py quick-start block to doctest format so it is picked up by `pytest --doctest-modules` (already the project default), preventing silent rot. - Extract streaming into its own SKILL.md subsection with guidance on when to prefer execute_stream() over collect(), sync and async iteration, and execute_stream_partitioned() for per-partition streams. - Generalize the date-arithmetic rule from Date32 to both Date32 and Date64 (both reject Duration at any precision, both accept month_day_nano_interval), and note that Timestamp columns differ and do accept Duration. - Document the PyArrow-inherited type mapping returned by to_pydict()/to_pylist(), including the nanosecond fallback to pandas.Timestamp / pandas.Timedelta and the to_pandas() footgun where date columns come back as an object dtype. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d4ca194 commit 07d38c7

2 files changed

Lines changed: 73 additions & 20 deletions

File tree

SKILL.md

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -264,13 +264,62 @@ polars_df = df.to_polars() # pl.DataFrame
264264
py_dict = df.to_pydict() # dict[str, list]
265265
py_list = df.to_pylist() # list[dict]
266266
count = df.count() # int
267+
```
268+
269+
### Date and Timestamp Type Conversion
270+
271+
The Python type returned by `to_pydict()` / `to_pylist()` depends on the Arrow
272+
column type, and the mapping is inherited from PyArrow:
273+
274+
| Arrow type | Python type returned |
275+
|---|---|
276+
| `timestamp(s)` / `(ms)` / `(us)` | `datetime.datetime` |
277+
| `timestamp(ns)` | `pandas.Timestamp` |
278+
| `date32` / `date64` | `datetime.date` |
279+
| `duration(s)` / `(ms)` / `(us)` | `datetime.timedelta` |
280+
| `duration(ns)` | `pandas.Timedelta` |
281+
282+
The nanosecond-precision fallback to pandas types is the main surprise:
283+
pandas is not a hard dependency of `datafusion`, but PyArrow reaches for it
284+
when `datetime.datetime` / `datetime.timedelta` would lose precision (stdlib
285+
types only go to microseconds). If you need plain stdlib types, cast to a
286+
coarser unit before collecting, e.g.
287+
`df.select(col("ts").cast(pa.timestamp("us")))`.
288+
289+
`df.to_pandas()` has its own footgun for dates: pandas has no pure-date dtype,
290+
so a `date32`/`date64` column comes back as an `object` column of
291+
`datetime.date` values rather than `datetime64[ns]`. If downstream code
292+
expects a datetime column, cast on the DataFusion side first:
293+
`col("ship_date").cast(pa.timestamp("ns"))`.
294+
295+
### Streaming Results
296+
297+
Prefer streaming over `collect()` when the result is too large to materialize
298+
in memory, when you want to start processing before the query finishes, or
299+
when you may break out of the loop early. `execute_stream()` pulls one
300+
`RecordBatch` at a time from the execution plan rather than buffering the
301+
whole result up front.
267302

268-
# Streaming
269-
stream = df.execute_stream() # RecordBatchStream (single partition)
303+
```python
304+
# Single-partition stream; batch is a datafusion.RecordBatch
305+
stream = df.execute_stream()
270306
for batch in stream:
271-
process(batch)
307+
process(batch.to_pyarrow()) # convert to pa.RecordBatch if needed
308+
309+
# DataFrame is iterable directly (delegates to execute_stream)
310+
for batch in df:
311+
process(batch.to_pyarrow())
312+
313+
# One stream per partition, for parallel consumption
314+
for stream in df.execute_stream_partitioned():
315+
for batch in stream:
316+
process(batch.to_pyarrow())
272317
```
273318

319+
Async iteration is also supported via `async for batch in df: ...` (or
320+
`df.execute_stream()`), which is useful when batches are interleaved with
321+
other I/O.
322+
274323
### Writing Results
275324

276325
```python
@@ -309,9 +358,9 @@ col("a") % lit(3) # modulo
309358

310359
### Date Arithmetic
311360

312-
`Date32` columns require `Interval` types for arithmetic, not `Duration`. Use
313-
PyArrow's `month_day_nano_interval` type, which takes a `(months, days, nanos)`
314-
tuple:
361+
`Date32` and `Date64` columns both require `Interval` types for arithmetic,
362+
not `Duration`. Use PyArrow's `month_day_nano_interval` type, which takes a
363+
`(months, days, nanos)` tuple:
315364

316365
```python
317366
import pyarrow as pa
@@ -324,9 +373,14 @@ col("ship_date") - lit(pa.scalar((3, 0, 0), type=pa.month_day_nano_interval()))
324373
```
325374

326375
**Important**: `lit(datetime.timedelta(days=90))` creates a `Duration(µs)`
327-
literal, which is **not** compatible with `Date32` arithmetic. Always use
376+
literal, which is **not** compatible with `Date32`/`Date64` arithmetic
377+
(`Duration(ms)` and `Duration(ns)` are rejected too). Always use
328378
`pa.month_day_nano_interval()` for date operations.
329379

380+
**Timestamps behave differently**: `Timestamp` columns *do* accept `Duration`,
381+
so `col("ts") - lit(datetime.timedelta(days=1))` works. The interval-only
382+
rule applies specifically to date columns.
383+
330384
### Comparisons
331385

332386
```python

python/datafusion/__init__.py

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -35,19 +35,18 @@
3535
3636
Quick start
3737
-----------
38-
::
39-
40-
from datafusion import SessionContext, col
41-
from datafusion import functions as F
42-
43-
ctx = SessionContext()
44-
df = ctx.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
45-
result = (
46-
df.filter(col("a") > 1)
47-
.with_column("total", col("a") + col("b"))
48-
.aggregate([], [F.sum(col("total")).alias("grand_total")])
49-
)
50-
print(result.to_pydict()) # {'grand_total': [16]}
38+
39+
>>> from datafusion import SessionContext, col
40+
>>> from datafusion import functions as F
41+
>>> ctx = SessionContext()
42+
>>> df = ctx.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
43+
>>> result = (
44+
... df.filter(col("a") > 1)
45+
... .with_column("total", col("a") + col("b"))
46+
... .aggregate([], [F.sum(col("total")).alias("grand_total")])
47+
... )
48+
>>> result.to_pydict()
49+
{'grand_total': [16]}
5150
5251
For a comprehensive guide to the DataFrame API -- including a SQL-to-DataFrame
5352
reference table, expression building, idiomatic patterns, and common pitfalls --

0 commit comments

Comments
 (0)