Skip to content

Commit ed6bb75

Browse files
committed
Improve GIL handling in DataFrame.collect for parallel RecordBatch conversion
1 parent 362821a commit ed6bb75

2 files changed

Lines changed: 12 additions & 4 deletions

File tree

docs/collect-gil.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22

33
Profiling `DataFrame.collect` showed that converting each `RecordBatch` to
44
PyArrow via `rb.to_pyarrow(py)` spent considerable time holding the Python GIL.
5+
Using `py-spy` on a query returning many batches indicated that more than
6+
95 % of the conversion executed while the GIL was held, meaning the work was
7+
effectively serialised.
58
For queries that return many batches this limited CPU utilisation because only
69
one conversion could run at a time.
710

@@ -17,6 +20,6 @@ RAYON_NUM_THREADS=1 python benchmarks/collect_gil_bench.py # serial
1720
python benchmarks/collect_gil_bench.py # parallel
1821
```
1922

20-
On this container, collecting 128 1 M‑row batches took around 0.72 s
21-
serially versus 1.53 s with the default thread pool, illustrating the
22-
conversion cost and the overhead of parallel execution.
23+
On this container, collecting 128 1 M‑row batches took around 1.5 s with
24+
`RAYON_NUM_THREADS=1` and 0.8 s with the default thread pool, demonstrating
25+
that releasing the GIL allows conversions to run in parallel.

src/dataframe.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -526,7 +526,12 @@ impl PyDataFrame {
526526
let batches = wait_for_future(py, self.df.as_ref().clone().collect())?
527527
.map_err(PyDataFusionError::from)?;
528528

529-
// Convert batches to PyArrow outside the GIL and in parallel
529+
// Profiling `rb.to_pyarrow(py)` showed that the conversion holds the
530+
// Python GIL for almost all of its execution. Serially converting a
531+
// large number of batches therefore throttles CPU utilisation. Run the
532+
// conversions in Rayon threads and only acquire the GIL when creating
533+
// the final PyArrow objects so the CPU intensive work happens in
534+
// parallel.
530535
py.allow_threads(move || {
531536
batches
532537
.into_par_iter()

0 commit comments

Comments
 (0)