Skip to content

Commit 1eee492

Browse files
committed
UNPICK review
1 parent 6d7622e commit 1eee492

1 file changed

Lines changed: 337 additions & 0 deletions

File tree

PR_REVIEW.md

Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
# Code Review: DataFrame Display Memory Limit Fix
2+
3+
## Review Date
4+
February 4, 2026
5+
6+
## Commits Reviewed
7+
- `fa9f2573`: Update DataFrameHtmlFormatter to enforce min_rows_display constraint and adjust default values
8+
- `0563f6ca`: Refactor DataFrame formatter to replace repr_rows with max_rows and update related validations
9+
10+
## Summary
11+
This PR addresses the issue where the memory size limit was not being respected when displaying large DataFrames. The root cause was a logical inconsistency where `repr_rows` (default: 10) was compared against `min_rows` (default: 20), which made the condition `rows_so_far < min_rows` always evaluate to True, preventing the memory limit from being enforced. The fix involves adjusting defaults, renaming for clarity, and enforcing proper validation constraints.
12+
13+
---
14+
15+
## Strengths ✅
16+
17+
### 1. **Addresses Root Cause Effectively**
18+
The fundamental issue is fixed by:
19+
- Aligning defaults: both `min_rows_display` and `max_rows` now default to 10
20+
- Adding validation to ensure `min_rows_display <= max_rows`
21+
- This prevents the logical trap where minimum always exceeded maximum
22+
23+
### 2. **Thoughtful API Renaming**
24+
- `repr_rows``max_rows` is semantically clearer
25+
- Better reflects the parameter's actual role (upper bound, not just "representation rows")
26+
- Improves discoverability and documentation clarity
27+
28+
### 3. **Backward Compatibility Preservation**
29+
The refactor maintains a deprecation path:
30+
```python
31+
if repr_rows is not None and repr_rows != max_rows:
32+
raise ValueError("Specify only max_rows (repr_rows is deprecated)")
33+
```
34+
- Allows gradual migration for existing code
35+
- Clear error message guides users to new parameter
36+
37+
### 4. **Comprehensive Testing**
38+
- Added validation tests for the new constraint
39+
- Updated existing tests to use new naming
40+
- Tests cover both direct instantiation and `configure_formatter()` path
41+
- Memory limit tests present (though see concerns below)
42+
43+
### 5. **Cross-Language Consistency**
44+
Changes are applied consistently across:
45+
- Python formatter class (`dataframe_formatter.py`)
46+
- Rust FFI layer (`dataframe.rs`)
47+
- Documentation (`rendering.rst`)
48+
49+
### 6. **Clear Documentation Updates**
50+
- Docstrings updated with new parameter names
51+
- Constraint relationships documented ("must be <= max_rows")
52+
- Comments updated in examples
53+
54+
---
55+
56+
## Issues & Suggestions 📝
57+
58+
### Severity: MEDIUM
59+
60+
#### 1. **Deprecation Strategy Could Be Stricter**
61+
**Location**: `dataframe_formatter.py` lines 207-210
62+
63+
**Current behavior**:
64+
```python
65+
if repr_rows is not None and repr_rows != max_rows:
66+
msg = "Specify only max_rows (repr_rows is deprecated)"
67+
raise ValueError(msg)
68+
```
69+
70+
**Concern**: This allows silently accepting `repr_rows` if it equals `max_rows`. While this maintains compatibility, it could hide usage of deprecated parameter.
71+
72+
**Suggestion**:
73+
- Consider warning vs error trade-off
74+
- Document in CHANGELOG that `repr_rows` parameter will be removed in version X.Y
75+
- Example improved approach:
76+
```python
77+
if repr_rows is not None:
78+
import warnings
79+
warnings.warn(
80+
"repr_rows parameter is deprecated, use max_rows instead",
81+
DeprecationWarning,
82+
stacklevel=2
83+
)
84+
if repr_rows != max_rows:
85+
msg = "Cannot specify both repr_rows and max_rows; use max_rows only"
86+
raise ValueError(msg)
87+
max_rows = repr_rows
88+
```
89+
90+
---
91+
92+
#### 2. **Rust FFI Bridge Logic Could Fail Silently**
93+
**Location**: `src/dataframe.rs` lines 156-161
94+
95+
**Current code**:
96+
```rust
97+
let max_rows = get_attr(formatter, "max_rows", default_config.max_rows);
98+
let repr_rows = get_attr(formatter, "repr_rows", max_rows);
99+
let max_rows = if repr_rows != max_rows {
100+
repr_rows
101+
} else {
102+
max_rows
103+
};
104+
```
105+
106+
**Concerns**:
107+
1. **Variable shadowing**: `max_rows` is assigned twice, reducing clarity
108+
2. **Silent override**: If both attributes exist with different values, no error/warning is raised to Rust side
109+
3. **Inconsistent with Python**: Python raises ValueError when both differ; Rust silently picks `repr_rows`
110+
4. **Documentation gap**: The `get_attr` fallback behavior isn't explained
111+
112+
**Suggestions**:
113+
- Extract to a helper function for clarity:
114+
```rust
115+
fn resolve_max_rows(formatter: &Bound<'_, PyAny>, default: usize) -> PyResult<usize> {
116+
let max_rows = get_attr(formatter, "max_rows", default);
117+
let repr_rows = get_attr(formatter, "repr_rows", default);
118+
119+
if repr_rows != max_rows && repr_rows != default {
120+
// User provided explicit repr_rows value
121+
return Ok(repr_rows);
122+
}
123+
Ok(max_rows)
124+
}
125+
```
126+
- Add a validation check in Rust or ensure Python-side validation is sufficient
127+
- Document the fallback precedence
128+
129+
---
130+
131+
#### 3. **Collection Condition Logic Warrants Verification**
132+
**Location**: `src/dataframe.rs` line 1366
133+
134+
**Current condition**:
135+
```rust
136+
while (size_estimate_so_far < max_bytes && rows_so_far < max_rows)
137+
|| rows_so_far < min_rows {
138+
```
139+
140+
**Assessment**: Logic appears correct post-fix. The condition ensures:
141+
1. If memory/row limits not hit AND haven't reached max: continue
142+
2. OR if haven't reached minimum: continue
143+
3. This allows minimum rows to override both memory and row limits ✓
144+
145+
**Minor improvement suggestion**:
146+
Add an explanatory comment about the semantics:
147+
```rust
148+
// Collect rows until EITHER:
149+
// (a) we hit a limit (memory or max_rows), OR
150+
// (b) we reach the guaranteed minimum
151+
// This ensures min_rows always display, even if memory/row limits would prevent it
152+
while (size_estimate_so_far < max_bytes && rows_so_far < max_rows)
153+
|| rows_so_far < min_rows {
154+
```
155+
156+
---
157+
158+
#### 4. **Edge Case: Memory Calculation Under Investigation**
159+
**Location**: `src/dataframe.rs` lines 1376-1386
160+
161+
**Current memory scaling logic**:
162+
```rust
163+
if size_estimate_so_far > max_bytes {
164+
let ratio = max_bytes as f32 / size_estimate_so_far as f32;
165+
let total_rows = rows_in_rb + rows_so_far;
166+
let mut reduced_row_num = (total_rows as f32 * ratio).round() as usize;
167+
if reduced_row_num < min_rows {
168+
reduced_row_num = min_rows.min(total_rows);
169+
}
170+
```
171+
172+
**Concern**: Floating-point rounding could result in:
173+
- Very small dataframes: aggressive rounding down, then min_rows override might show fewer rows than requested
174+
- Very large memory requests: ratio-based calculation might not distribute row reductions fairly across batches
175+
176+
**Suggestions for future improvement** (not blocking):
177+
1. Add test cases for boundary conditions (DataFrame at exactly max_bytes, just under, just over)
178+
2. Consider deterministic rounding strategy instead of `f32::round()`
179+
3. Document expected behavior when memory limit is very close to actual usage
180+
181+
---
182+
183+
### Severity: LOW (Polish/Documentation)
184+
185+
#### 5. **Backward-Compatibility Alias Could Be Marked @deprecated**
186+
**Location**: `dataframe_formatter.py` lines 241-242
187+
188+
```python
189+
self.max_rows = max_rows
190+
# Backwards-compatible alias
191+
self.repr_rows = max_rows
192+
```
193+
194+
**Suggestion**: Consider adding property with deprecation warning:
195+
```python
196+
@property
197+
def repr_rows(self) -> int:
198+
"""Deprecated: use max_rows instead."""
199+
return self.max_rows
200+
201+
@repr_rows.setter
202+
def repr_rows(self, value: int) -> None:
203+
warnings.warn("repr_rows is deprecated, use max_rows", DeprecationWarning)
204+
self.max_rows = value
205+
```
206+
This would surface deprecation to direct attribute access as well.
207+
208+
---
209+
210+
#### 6. **Test Coverage Gap: Deprecated repr_rows Path**
211+
**Location**: `python/tests/test_dataframe.py`
212+
213+
**Observation**: Tests thoroughly cover the new `max_rows` parameter but don't explicitly test that deprecated `repr_rows` still works (even in backward-compatibility mode).
214+
215+
**Suggestion**: Add a test:
216+
```python
217+
def test_repr_rows_backward_compatibility(clean_formatter_state):
218+
"""Verify that repr_rows parameter still works as deprecated alias."""
219+
# Should work when not conflicting with max_rows
220+
formatter = DataFrameHtmlFormatter(repr_rows=5)
221+
assert formatter.max_rows == 5
222+
assert formatter.repr_rows == 5
223+
224+
# Should fail when conflicting
225+
with pytest.raises(ValueError, match="Specify only max_rows"):
226+
DataFrameHtmlFormatter(repr_rows=5, max_rows=10)
227+
```
228+
229+
---
230+
231+
#### 7. **Documentation Could Be Clearer on Default Relationship**
232+
**Location**: `docs/source/user-guide/dataframe/rendering.rst`
233+
234+
The documentation mentions setting `min_rows_display=20, max_rows=20` to get exactly 20 rows, but doesn't explain:
235+
- What happens if DataFrame has < min_rows_display rows
236+
- How memory limit interacts with both parameters
237+
- Precedence order: Which limit triggers first?
238+
239+
**Suggestion**: Add a "Parameters Interaction" section:
240+
```
241+
The display respects these constraints in the following order:
242+
1. Memory limit (max_memory_bytes) applies first
243+
2. Row limit (max_rows) applies second
244+
3. Minimum guarantee (min_rows_display) overrides both if needed
245+
246+
To display exactly N rows: set min_rows_display=N and max_rows=N with sufficient memory.
247+
```
248+
249+
---
250+
251+
#### 8. **Consider Helper Function for Configuration Validation**
252+
**Location**: Both `dataframe_formatter.py` and `src/dataframe.rs`
253+
254+
**Observation**: Parameter validation logic is duplicated:
255+
- Python: checks in `__init__` method
256+
- Rust: checks in `validate()` method
257+
258+
**Suggestion for future refactoring** (not blocking for this PR):
259+
Extract to shared validation logic or document the validation contract more explicitly to reduce duplication risk during maintenance.
260+
261+
---
262+
263+
## Behavior Verification
264+
265+
### ✅ The Bug is Fixed
266+
**Before**: `min_rows=20 > repr_rows=10` → condition always true → memory limit ignored
267+
**After**: `min_rows_display=10 <= max_rows=10` → memory limit can trigger → behavior correct
268+
269+
### ✅ Integration Points
270+
- Python ↔ Rust: FFI properly marshals new parameter names
271+
- Direct instantiation ↔ `configure_formatter()`: Both paths validated
272+
- Attribute access: `formatter.repr_rows` maintains backward compatibility
273+
274+
---
275+
276+
## Decision
277+
278+
### **APPROVE WITH SUGGESTIONS**
279+
280+
**Reasoning**:
281+
- The PR solves the core problem (memory limit now respected)
282+
- Implementation is solid with good test coverage
283+
- Naming is improved for clarity
284+
- Backward compatibility is maintained
285+
- Non-blocking suggestions focus on polish and documentation
286+
287+
**Blocking Issues**: None
288+
289+
**Recommended Follow-ups** (for future work, not this PR):
290+
1. Add deprecation warnings for `repr_rows` parameter
291+
2. Improve Rust FFI logic clarity with helper function
292+
3. Expand documentation on parameter interactions
293+
4. Add test coverage for deprecated parameter path
294+
5. Validate floating-point rounding behavior in edge cases
295+
296+
---
297+
298+
## Inline Comments by File
299+
300+
### `python/datafusion/dataframe_formatter.py`
301+
302+
| Line | Suggestion |
303+
|------|-----------|
304+
| 207-210 | Consider adding deprecation warning alongside ValueError |
305+
| 241-242 | Could use @property decorator with warning for `repr_rows` |
306+
| 235-242 | Extract validation logic to separate method for testability |
307+
308+
### `src/dataframe.rs`
309+
310+
| Line | Suggestion |
311+
|------|-----------|
312+
| 156-161 | Extract max_rows resolution to helper function (reduce shadowing) |
313+
| 1366 | Add comment explaining min_rows override semantics |
314+
| 1376-1386 | Consider test for boundary conditions (exact max_bytes size) |
315+
316+
### `python/tests/test_dataframe.py`
317+
318+
| Line | Suggestion |
319+
|------|-----------|
320+
| 1467-1475 | Add test for deprecated repr_rows backward compatibility |
321+
| 1481-1511 | Add test for memory boundary conditions |
322+
323+
### `docs/source/user-guide/dataframe/rendering.rst`
324+
325+
| Line | Suggestion |
326+
|------|-----------|
327+
| 58-65 | Add "Parameters Interaction" section explaining precedence |
328+
| 191+ | Document behavior when DataFrame rows < min_rows_display |
329+
330+
---
331+
332+
## Questions for Author
333+
334+
1. Was the choice to silently accept matching `repr_rows` values intentional, or should this always raise a deprecation warning?
335+
2. Is the Rust-side precedence (repr_rows overrides max_rows if both exist) intentional to match Python behavior?
336+
3. Are there any known issues with floating-point rounding in edge cases with very small or very large DataFrames?
337+

0 commit comments

Comments
 (0)