Skip to content

Commit 77f821d

Browse files
committed
UNPICK response
1 parent ca788aa commit 77f821d

1 file changed

Lines changed: 353 additions & 0 deletions

File tree

PR_RESPONSE.md

Lines changed: 353 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,353 @@
1+
# PR Response to Review Comments
2+
3+
## Overview
4+
5+
Thank you @timsaucer for the detailed review! Your comments have identified several important improvements needed to make the deprecation path smoother and the implementation more robust. Below are responses and action plans for each comment.
6+
7+
---
8+
9+
## Documentation: `docs/source/user-guide/dataframe/rendering.rst`
10+
11+
### Comment 1: Lines 60-61 - Inconsistent defaults and naming
12+
13+
**Issue:**
14+
```python
15+
min_rows_display = 20, # Minimum number of rows to display
16+
repr_rows = 10, # Number of rows to display in __repr__
17+
```
18+
19+
**Comment:**
20+
> It looks like the default here has `min_rows > max_rows`. Also should we have consistent naming of the two?
21+
> Either `min_rows` and `max_rows` or `min_rows_display` and `max_rows_display`?
22+
> I think the `_display` was differentiating what happens during a `display()` call vs `__repr__`, but I think these values get used during both calls.
23+
24+
**Response & Action Plan:**
25+
26+
You're absolutely right on both points:
27+
28+
1. **Default values issue**: Having `min_rows_display=20` and `max_rows=10` creates an invalid configuration. Both parameters use the same rendering logic path, so this is incorrect.
29+
30+
2. **Naming consistency**: The `_display` suffix is misleading since both parameters apply to all rendering contexts (both `__repr__` and explicit `display()` calls).
31+
32+
**Action items:**
33+
- [ ] Change defaults in documentation to `min_rows_display=10, max_rows=10` (or similar valid configuration like `min_rows_display=5, max_rows=10`)
34+
- [ ] Consider renaming to either:
35+
- Option A: `min_rows` and `max_rows` (simpler, clearer)
36+
- Option B: `min_rows_display` and `max_rows_display` (more verbose but consistent)
37+
- [ ] Update all examples in documentation to use consistent naming
38+
- [ ] Ensure validation logic enforces `min_rows_display <= max_rows`
39+
40+
**Recommendation:** I lean toward Option A (`min_rows`/`max_rows`) for simplicity unless there's a strong reason for the `_display` suffix.
41+
42+
---
43+
44+
### Comment 2: Lines 193-194 - Same naming/default issue
45+
46+
**Issue:**
47+
```python
48+
min_rows_display = 50, # Always show at least 50 rows
49+
max_rows = 20 # Show 20 rows in __repr__ output
50+
```
51+
52+
**Comment:**
53+
> Same as above, difference between `_display` and without the trailer and also we have here `min_rows > max_rows`.
54+
55+
**Response & Action Plan:**
56+
57+
Same issues as above in a different example.
58+
59+
**Action items:**
60+
- [ ] Fix the example to use valid configuration: `min_rows_display=20, max_rows=50` (or similar)
61+
- [ ] Apply consistent naming per the decision in Comment 1
62+
- [ ] Review all other documentation examples for similar issues
63+
64+
---
65+
66+
## Python: `python/datafusion/dataframe_formatter.py`
67+
68+
### Comment 3: Lines 167-169/256 - Duplicate type/default information
69+
70+
**Issue:**
71+
```python
72+
min_rows_display : int, default 10
73+
Minimum number of rows to display.
74+
repr_rows : int, default 10
75+
Default number of rows to display in repr output.
76+
```
77+
78+
**Comment:**
79+
> It's not about this PR per se, but maybe this is an opportunity to tighten up the comments here. We're repeating ourselves with the types and defaults. Those are already in the type hints.
80+
> I think it's becoming customary to not duplicate that information and the argument line is the preferred place to keep it. That way we don't have to worry about maintaining the values in two places.
81+
82+
**Response & Action Plan:**
83+
84+
Excellent point! This is indeed Python best practice. Since type hints and default values are already in the signature, the docstring should focus on **what** the parameter does and **why** you'd use it, not repeat information that's already in the code.
85+
86+
**Action items:**
87+
- [ ] Refactor docstring to follow NumPy/pandas style without type/default duplication:
88+
```python
89+
Parameters
90+
----------
91+
max_cell_length
92+
Maximum length of cell content before truncation.
93+
max_width
94+
Maximum width of the displayed table in pixels.
95+
max_memory_bytes
96+
Maximum memory in bytes for rendered data. Helps prevent performance
97+
issues with large datasets.
98+
min_rows_display
99+
Minimum number of rows to display even if memory limit is reached.
100+
Must not exceed max_rows.
101+
max_rows
102+
Maximum number of rows to display. Takes precedence over memory limits
103+
when fewer rows are requested.
104+
```
105+
- [ ] Apply this pattern consistently across all docstrings in the file
106+
- [ ] Keep detailed explanation of behavior/constraints in the description section
107+
108+
---
109+
110+
### Comment 4 & 5: Lines 335-338, 351-354 - Why add accessors for deprecated property?
111+
112+
**Issue:**
113+
```python
114+
@property
115+
def repr_rows(self) -> int:
116+
"""Get the maximum number of rows (deprecated name)..."""
117+
return self._max_rows
118+
119+
@repr_rows.setter
120+
def repr_rows(self, value: int) -> None:
121+
"""Set the maximum number of rows using deprecated name..."""
122+
warnings.warn(...)
123+
self._max_rows = value
124+
```
125+
126+
**Comment:**
127+
> If `repr_rows` is being deprecated, why add an accessor?
128+
> Same, why add for deprecated?
129+
130+
**Response & Action Plan:**
131+
132+
This is a great question about deprecation strategy. The accessors are being added for **backward compatibility** during the deprecation period:
133+
134+
**Rationale:**
135+
1. **User code may directly access the property**: Code like `formatter.repr_rows = 20` needs to continue working during the deprecation period
136+
2. **Graceful migration path**: Users get a warning but their code doesn't break
137+
3. **Custom formatter implementations**: External code that inherits from the formatter and accesses `repr_rows` directly will continue to work
138+
139+
However, I see your point - if users are unlikely to access these properties directly (most use `configure_formatter()`), then these accessors may be overkill.
140+
141+
**Questions for consideration:**
142+
- Do we know of any user code that directly accesses `formatter.repr_rows`?
143+
- What's our deprecation timeline? (1 release cycle, 2 release cycles?)
144+
- Should we follow a "hard break" vs "soft deprecation" approach?
145+
146+
**Proposed approach:**
147+
- [ ] Keep the accessors for at least 1-2 release cycles to allow users to migrate
148+
- [ ] Add prominent deprecation warnings in release notes
149+
- [ ] Remove in a future major version (e.g., next major version bump)
150+
- [ ] Alternative: If direct property access is rare, we could remove accessors and only handle `repr_rows` in `configure_formatter()`
151+
152+
**Recommendation:** Keep the accessors for now with clear deprecation warnings, plan removal in next major version.
153+
154+
---
155+
156+
## Tests: `python/tests/test_dataframe.py`
157+
158+
### Comment 6: Line 1460-1462 - Use `large_df` instead of `df`
159+
160+
**Issue:**
161+
```python
162+
def test_html_formatter_memory_boundary_conditions(df, clean_formatter_state):
163+
```
164+
165+
**Comment:**
166+
> Maybe switch to `large_df` instead of `df` here?
167+
168+
**Response & Action Plan:**
169+
170+
Excellent suggestion! The `large_df` fixture (100,000 rows) is much better suited for testing memory boundary conditions than the standard `df` fixture (3 rows).
171+
172+
**Action items:**
173+
- [ ] Change test signature to use `large_df` fixture
174+
- [ ] Update test to work with larger dataset
175+
- [ ] Adjust assertions to account for the larger data size
176+
177+
---
178+
179+
### Comment 7: Lines 1468-1471 - Adjust max_rows for large_df
180+
181+
**Issue:**
182+
```python
183+
configure_formatter(max_memory_bytes=10 * MB, min_rows_display=1, max_rows=100)
184+
```
185+
186+
**Comment:**
187+
> If you do switch to `large_df` then I think it may go above the 100 limit you have.
188+
189+
**Response & Action Plan:**
190+
191+
Good catch! With `large_df` containing 100,000 rows, the `max_rows=100` limit would be hit before we get useful memory limit testing.
192+
193+
**Action items:**
194+
- [ ] Increase `max_rows` to a value well above what we expect to collect (e.g., `max_rows=100000`)
195+
- [ ] Ensure we're actually testing memory limits, not row limits
196+
- [ ] Add explicit comments about why we set these values
197+
198+
---
199+
200+
### Comment 8: Lines 1473-1476 - Test actual early termination of stream
201+
202+
**Issue:**
203+
```python
204+
# Test 1: Very small memory limit should still respect min_rows
205+
configure_formatter(max_memory_bytes=10, min_rows_display=1)
206+
```
207+
208+
**Comment:**
209+
> I think a better test is one where we *do* hit the memory limit well before we hit the min number of rows, hence the recommendation to switch to `large_df`.
210+
> Actually, maybe we want a different dataframe, one where we know we have multiple batches instead of a single batch. The thing this isn't doing is verifying we've ended the stream early, but I think that would have to be a rust test instead of a pytest.
211+
212+
**Response & Action Plan:**
213+
214+
This is an excellent observation about the limitations of the current test. You're identifying two key testing gaps:
215+
216+
1. **Multi-batch streaming**: The current test may use a single-batch DataFrame, so we never actually test stream termination
217+
2. **Rust-level verification**: Python tests can't verify that the stream was actually closed early vs. all data being collected and then truncated
218+
219+
**Action items:**
220+
221+
**Python test improvements:**
222+
- [ ] Create a test DataFrame that definitely spans multiple record batches
223+
```python
224+
# Create a multi-batch DataFrame explicitly
225+
batches = [pa.record_batch({"a": range(10000), "b": [f"str_{i}" for i in range(10000)]})
226+
for _ in range(10)]
227+
large_multi_batch_df = ctx.from_arrow(batches)
228+
```
229+
- [ ] Test with memory limit that would be exceeded by 2-3 batches but not 1 batch
230+
- [ ] Verify behavior: partial data + truncation message + respects min_rows
231+
232+
**Rust test considerations:**
233+
- [ ] Add Rust unit test in `src/dataframe.rs` that:
234+
- Creates a mock stream with known batch sizes
235+
- Sets a memory limit that should trigger after N batches
236+
- Verifies that only N batches were consumed (stream was closed early)
237+
- Verifies `has_more_rows` flag is set correctly
238+
239+
**Example Rust test structure:**
240+
```rust
241+
#[test]
242+
fn test_stream_early_termination() {
243+
// Create a stream with known batches
244+
// Set memory limit to stop after 2 batches
245+
// Verify only 2 batches were consumed
246+
// Verify has_more_rows = true
247+
}
248+
```
249+
250+
**Note:** The Rust test would be more robust but also more complex. We should discuss if it's worth the effort vs. relying on integration testing.
251+
252+
---
253+
254+
## Rust: `src/dataframe.rs`
255+
256+
### Comment 9: Lines 150-156 - Backward-compatible attribute lookup
257+
258+
**Issue:**
259+
```rust
260+
let max_rows = get_attr(formatter, "max_rows", default_config.max_rows);
261+
```
262+
263+
**Comment:**
264+
> Since users may have provided their own implementation for the formatter, I think we need to first try getting `max_rows`.
265+
> If that fails, try getting `repr_rows`. If that fails, take default.
266+
> When we remove `repr_rows` entirely after it's been deprecated for a few releases, then we can revert to this simpler logic.
267+
268+
**Response & Action Plan:**
269+
270+
This is a **critical point** for backward compatibility! Users with custom formatter implementations won't have `max_rows` attribute yet, only `repr_rows`. The current code would break their formatters.
271+
272+
**Action items:**
273+
- [ ] Implement fallback logic in Rust:
274+
```rust
275+
// Try new name first, fall back to deprecated name, then default
276+
let max_rows = get_attr(formatter, "max_rows", 0)
277+
.or_else(|| get_attr(formatter, "repr_rows", 0))
278+
.unwrap_or(default_config.max_rows);
279+
```
280+
281+
Or more explicitly:
282+
```rust
283+
let max_rows = if let Ok(value) = formatter.getattr("max_rows") {
284+
value.extract::<usize>().unwrap_or(default_config.max_rows)
285+
} else if let Ok(value) = formatter.getattr("repr_rows") {
286+
value.extract::<usize>().unwrap_or(default_config.max_rows)
287+
} else {
288+
default_config.max_rows
289+
};
290+
```
291+
292+
- [ ] Add a code comment explaining the fallback chain:
293+
```rust
294+
// Try max_rows first (new name), fall back to repr_rows (deprecated),
295+
// then use default. This ensures backward compatibility with custom
296+
// formatter implementations during the deprecation period.
297+
```
298+
299+
- [ ] Add test case for custom formatter with only `repr_rows` attribute
300+
- [ ] Document deprecation timeline for when we can remove the fallback
301+
302+
**Timeline suggestion:**
303+
- Release N: Add `max_rows`, deprecate `repr_rows` with warnings
304+
- Release N+1: Keep both supported
305+
- Release N+2: Remove `repr_rows` fallback in Rust
306+
- Release N+3: Remove `repr_rows` property entirely
307+
308+
---
309+
310+
## Summary of Action Items
311+
312+
### High Priority (Breaks backward compatibility if not addressed)
313+
1.**Rust fallback logic** (Comment 9) - Must be fixed to support custom formatters
314+
2.**Documentation default values** (Comments 1, 2) - Current examples show invalid configurations
315+
316+
### Medium Priority (Quality & correctness improvements)
317+
3.**Test improvements** (Comments 6, 7, 8) - Use `large_df`, test actual streaming behavior
318+
4.**Naming consistency** (Comments 1, 2) - Decide on `min_rows`/`max_rows` vs `min_rows_display`/`max_rows_display`
319+
320+
### Low Priority (Code quality & maintainability)
321+
5.**Docstring improvements** (Comment 3) - Remove duplicate type/default info
322+
6.**Deprecation accessor strategy** (Comments 4, 5) - Document rationale or remove
323+
324+
### Additional Considerations
325+
- Document deprecation timeline clearly in CHANGELOG
326+
- Add migration guide for users with custom formatters
327+
- Consider adding Rust-level tests for stream termination verification
328+
329+
---
330+
331+
## Questions for Discussion
332+
333+
1. **Naming convention**: Should we standardize on `min_rows`/`max_rows` or keep `min_rows_display`/`max_rows_display`? (My vote: simpler names)
334+
335+
2. **Deprecation timeline**: How many releases should we support `repr_rows` before removal? (My suggestion: 2-3 releases)
336+
337+
3. **Rust tests**: Should we invest in Rust unit tests for stream termination, or are the Python integration tests sufficient?
338+
339+
4. **Default values**: What should the actual defaults be? Current suggestion: `min_rows_display=10, max_rows=10` or `min_rows_display=5, max_rows=10`?
340+
341+
---
342+
343+
## Implementation Plan
344+
345+
If the above responses look good, I can proceed with implementing the fixes in this order:
346+
347+
1. **Critical fix**: Implement Rust fallback logic for `repr_rows``max_rows` transition
348+
2. **Documentation**: Fix all examples and naming consistency
349+
3. **Tests**: Switch to `large_df`, add multi-batch test cases
350+
4. **Docstrings**: Clean up redundant type/default information
351+
5. **Polish**: Add code comments explaining deprecation strategy
352+
353+
Please let me know if you'd like me to proceed with these changes or if you have different preferences for any of the action items above!

0 commit comments

Comments
 (0)