Address review feedback: doctest, streaming, date/timestamp

timsaucer · claude · timsaucer · commit 07d38c70e94f · 2026-04-23T16:50:44.000-04:00
- Convert the __init__.py quick-start block to doctest format so it is
  picked up by `pytest --doctest-modules` (already the project default),
  preventing silent rot.
- Extract streaming into its own SKILL.md subsection with guidance on
  when to prefer execute_stream() over collect(), sync and async
  iteration, and execute_stream_partitioned() for per-partition streams.
- Generalize the date-arithmetic rule from Date32 to both Date32 and
  Date64 (both reject Duration at any precision, both accept
  month_day_nano_interval), and note that Timestamp columns differ and
  do accept Duration.
- Document the PyArrow-inherited type mapping returned by
  to_pydict()/to_pylist(), including the nanosecond fallback to
  pandas.Timestamp / pandas.Timedelta and the to_pandas() footgun where
  date columns come back as an object dtype.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/SKILL.md b/SKILL.md
@@ -264,13 +264,62 @@ polars_df = df.to_polars()              # pl.DataFrame
 py_dict = df.to_pydict()                # dict[str, list]
 py_list = df.to_pylist()                # list[dict]
 count = df.count()                      # int
+```
+
+### Date and Timestamp Type Conversion
+
+The Python type returned by `to_pydict()` / `to_pylist()` depends on the Arrow
+column type, and the mapping is inherited from PyArrow:
+
+| Arrow type | Python type returned |
+|---|---|
+| `timestamp(s)` / `(ms)` / `(us)` | `datetime.datetime` |
+| `timestamp(ns)` | `pandas.Timestamp` |
+| `date32` / `date64` | `datetime.date` |
+| `duration(s)` / `(ms)` / `(us)` | `datetime.timedelta` |
+| `duration(ns)` | `pandas.Timedelta` |
+
+The nanosecond-precision fallback to pandas types is the main surprise:
+pandas is not a hard dependency of `datafusion`, but PyArrow reaches for it
+when `datetime.datetime` / `datetime.timedelta` would lose precision (stdlib
+types only go to microseconds). If you need plain stdlib types, cast to a
+coarser unit before collecting, e.g.
+`df.select(col("ts").cast(pa.timestamp("us")))`.
+
+`df.to_pandas()` has its own footgun for dates: pandas has no pure-date dtype,
+so a `date32`/`date64` column comes back as an `object` column of
+`datetime.date` values rather than `datetime64[ns]`. If downstream code
+expects a datetime column, cast on the DataFusion side first:
+`col("ship_date").cast(pa.timestamp("ns"))`.
+
+### Streaming Results
+
+Prefer streaming over `collect()` when the result is too large to materialize
+in memory, when you want to start processing before the query finishes, or
+when you may break out of the loop early. `execute_stream()` pulls one
+`RecordBatch` at a time from the execution plan rather than buffering the
+whole result up front.
 
-# Streaming
-stream = df.execute_stream()            # RecordBatchStream (single partition)
+```python
+# Single-partition stream; batch is a datafusion.RecordBatch
+stream = df.execute_stream()
 for batch in stream:
-    process(batch)
+    process(batch.to_pyarrow())         # convert to pa.RecordBatch if needed
+
+# DataFrame is iterable directly (delegates to execute_stream)
+for batch in df:
+    process(batch.to_pyarrow())
+
+# One stream per partition, for parallel consumption
+for stream in df.execute_stream_partitioned():
+    for batch in stream:
+        process(batch.to_pyarrow())
 ```
 
+Async iteration is also supported via `async for batch in df: ...` (or
+`df.execute_stream()`), which is useful when batches are interleaved with
+other I/O.
+
 ### Writing Results
 
 ```python
@@ -309,9 +358,9 @@ col("a") % lit(3)                          # modulo
 
 ### Date Arithmetic
 
-`Date32` columns require `Interval` types for arithmetic, not `Duration`. Use
-PyArrow's `month_day_nano_interval` type, which takes a `(months, days, nanos)`
-tuple:
+`Date32` and `Date64` columns both require `Interval` types for arithmetic,
+not `Duration`. Use PyArrow's `month_day_nano_interval` type, which takes a
+`(months, days, nanos)` tuple:
 
 ```python
 import pyarrow as pa
@@ -324,9 +373,14 @@ col("ship_date") - lit(pa.scalar((3, 0, 0), type=pa.month_day_nano_interval()))
 ```
 
 **Important**: `lit(datetime.timedelta(days=90))` creates a `Duration(µs)`
-literal, which is **not** compatible with `Date32` arithmetic. Always use
+literal, which is **not** compatible with `Date32`/`Date64` arithmetic
+(`Duration(ms)` and `Duration(ns)` are rejected too). Always use
 `pa.month_day_nano_interval()` for date operations.
 
+**Timestamps behave differently**: `Timestamp` columns *do* accept `Duration`,
+so `col("ts") - lit(datetime.timedelta(days=1))` works. The interval-only
+rule applies specifically to date columns.
+
 ### Comparisons
 
 ```python
diff --git a/python/datafusion/__init__.py b/python/datafusion/__init__.py
@@ -35,19 +35,18 @@
 
 Quick start
 -----------
-::
-
-    from datafusion import SessionContext, col
-    from datafusion import functions as F
-
-    ctx = SessionContext()
-    df = ctx.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
-    result = (
-        df.filter(col("a") > 1)
-          .with_column("total", col("a") + col("b"))
-          .aggregate([], [F.sum(col("total")).alias("grand_total")])
-    )
-    print(result.to_pydict())  # {'grand_total': [16]}
+
+>>> from datafusion import SessionContext, col
+>>> from datafusion import functions as F
+>>> ctx = SessionContext()
+>>> df = ctx.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
+>>> result = (
+...     df.filter(col("a") > 1)
+...       .with_column("total", col("a") + col("b"))
+...       .aggregate([], [F.sum(col("total")).alias("grand_total")])
+... )
+>>> result.to_pydict()
+{'grand_total': [16]}
 
 For a comprehensive guide to the DataFrame API -- including a SQL-to-DataFrame
 reference table, expression building, idiomatic patterns, and common pitfalls --