|
| 1 | +# Code Review: DataFrame Display Memory Limit Fix |
| 2 | + |
| 3 | +## Review Date |
| 4 | +February 4, 2026 |
| 5 | + |
| 6 | +## Commits Reviewed |
| 7 | +- `fa9f2573`: Update DataFrameHtmlFormatter to enforce min_rows_display constraint and adjust default values |
| 8 | +- `0563f6ca`: Refactor DataFrame formatter to replace repr_rows with max_rows and update related validations |
| 9 | + |
| 10 | +## Summary |
| 11 | +This PR addresses the issue where the memory size limit was not being respected when displaying large DataFrames. The root cause was a logical inconsistency where `repr_rows` (default: 10) was compared against `min_rows` (default: 20), which made the condition `rows_so_far < min_rows` always evaluate to True, preventing the memory limit from being enforced. The fix involves adjusting defaults, renaming for clarity, and enforcing proper validation constraints. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Strengths ✅ |
| 16 | + |
| 17 | +### 1. **Addresses Root Cause Effectively** |
| 18 | +The fundamental issue is fixed by: |
| 19 | +- Aligning defaults: both `min_rows_display` and `max_rows` now default to 10 |
| 20 | +- Adding validation to ensure `min_rows_display <= max_rows` |
| 21 | +- This prevents the logical trap where minimum always exceeded maximum |
| 22 | + |
| 23 | +### 2. **Thoughtful API Renaming** |
| 24 | +- `repr_rows` → `max_rows` is semantically clearer |
| 25 | +- Better reflects the parameter's actual role (upper bound, not just "representation rows") |
| 26 | +- Improves discoverability and documentation clarity |
| 27 | + |
| 28 | +### 3. **Backward Compatibility Preservation** |
| 29 | +The refactor maintains a deprecation path: |
| 30 | +```python |
| 31 | +if repr_rows is not None and repr_rows != max_rows: |
| 32 | + raise ValueError("Specify only max_rows (repr_rows is deprecated)") |
| 33 | +``` |
| 34 | +- Allows gradual migration for existing code |
| 35 | +- Clear error message guides users to new parameter |
| 36 | + |
| 37 | +### 4. **Comprehensive Testing** |
| 38 | +- Added validation tests for the new constraint |
| 39 | +- Updated existing tests to use new naming |
| 40 | +- Tests cover both direct instantiation and `configure_formatter()` path |
| 41 | +- Memory limit tests present (though see concerns below) |
| 42 | + |
| 43 | +### 5. **Cross-Language Consistency** |
| 44 | +Changes are applied consistently across: |
| 45 | +- Python formatter class (`dataframe_formatter.py`) |
| 46 | +- Rust FFI layer (`dataframe.rs`) |
| 47 | +- Documentation (`rendering.rst`) |
| 48 | + |
| 49 | +### 6. **Clear Documentation Updates** |
| 50 | +- Docstrings updated with new parameter names |
| 51 | +- Constraint relationships documented ("must be <= max_rows") |
| 52 | +- Comments updated in examples |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +## Issues & Suggestions 📝 |
| 57 | + |
| 58 | +### Severity: MEDIUM |
| 59 | + |
| 60 | +#### 1. **Deprecation Strategy Could Be Stricter** |
| 61 | +**Location**: `dataframe_formatter.py` lines 207-210 |
| 62 | + |
| 63 | +**Current behavior**: |
| 64 | +```python |
| 65 | +if repr_rows is not None and repr_rows != max_rows: |
| 66 | + msg = "Specify only max_rows (repr_rows is deprecated)" |
| 67 | + raise ValueError(msg) |
| 68 | +``` |
| 69 | + |
| 70 | +**Concern**: This allows silently accepting `repr_rows` if it equals `max_rows`. While this maintains compatibility, it could hide usage of deprecated parameter. |
| 71 | + |
| 72 | +**Suggestion**: |
| 73 | +- Consider warning vs error trade-off |
| 74 | +- Document in CHANGELOG that `repr_rows` parameter will be removed in version X.Y |
| 75 | +- Example improved approach: |
| 76 | +```python |
| 77 | +if repr_rows is not None: |
| 78 | + import warnings |
| 79 | + warnings.warn( |
| 80 | + "repr_rows parameter is deprecated, use max_rows instead", |
| 81 | + DeprecationWarning, |
| 82 | + stacklevel=2 |
| 83 | + ) |
| 84 | + if repr_rows != max_rows: |
| 85 | + msg = "Cannot specify both repr_rows and max_rows; use max_rows only" |
| 86 | + raise ValueError(msg) |
| 87 | + max_rows = repr_rows |
| 88 | +``` |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +#### 2. **Rust FFI Bridge Logic Could Fail Silently** |
| 93 | +**Location**: `src/dataframe.rs` lines 156-161 |
| 94 | + |
| 95 | +**Current code**: |
| 96 | +```rust |
| 97 | +let max_rows = get_attr(formatter, "max_rows", default_config.max_rows); |
| 98 | +let repr_rows = get_attr(formatter, "repr_rows", max_rows); |
| 99 | +let max_rows = if repr_rows != max_rows { |
| 100 | + repr_rows |
| 101 | +} else { |
| 102 | + max_rows |
| 103 | +}; |
| 104 | +``` |
| 105 | + |
| 106 | +**Concerns**: |
| 107 | +1. **Variable shadowing**: `max_rows` is assigned twice, reducing clarity |
| 108 | +2. **Silent override**: If both attributes exist with different values, no error/warning is raised to Rust side |
| 109 | +3. **Inconsistent with Python**: Python raises ValueError when both differ; Rust silently picks `repr_rows` |
| 110 | +4. **Documentation gap**: The `get_attr` fallback behavior isn't explained |
| 111 | + |
| 112 | +**Suggestions**: |
| 113 | +- Extract to a helper function for clarity: |
| 114 | +```rust |
| 115 | +fn resolve_max_rows(formatter: &Bound<'_, PyAny>, default: usize) -> PyResult<usize> { |
| 116 | + let max_rows = get_attr(formatter, "max_rows", default); |
| 117 | + let repr_rows = get_attr(formatter, "repr_rows", default); |
| 118 | + |
| 119 | + if repr_rows != max_rows && repr_rows != default { |
| 120 | + // User provided explicit repr_rows value |
| 121 | + return Ok(repr_rows); |
| 122 | + } |
| 123 | + Ok(max_rows) |
| 124 | +} |
| 125 | +``` |
| 126 | +- Add a validation check in Rust or ensure Python-side validation is sufficient |
| 127 | +- Document the fallback precedence |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +#### 3. **Collection Condition Logic Warrants Verification** |
| 132 | +**Location**: `src/dataframe.rs` line 1366 |
| 133 | + |
| 134 | +**Current condition**: |
| 135 | +```rust |
| 136 | +while (size_estimate_so_far < max_bytes && rows_so_far < max_rows) |
| 137 | + || rows_so_far < min_rows { |
| 138 | +``` |
| 139 | + |
| 140 | +**Assessment**: Logic appears correct post-fix. The condition ensures: |
| 141 | +1. If memory/row limits not hit AND haven't reached max: continue |
| 142 | +2. OR if haven't reached minimum: continue |
| 143 | +3. This allows minimum rows to override both memory and row limits ✓ |
| 144 | + |
| 145 | +**Minor improvement suggestion**: |
| 146 | +Add an explanatory comment about the semantics: |
| 147 | +```rust |
| 148 | +// Collect rows until EITHER: |
| 149 | +// (a) we hit a limit (memory or max_rows), OR |
| 150 | +// (b) we reach the guaranteed minimum |
| 151 | +// This ensures min_rows always display, even if memory/row limits would prevent it |
| 152 | +while (size_estimate_so_far < max_bytes && rows_so_far < max_rows) |
| 153 | + || rows_so_far < min_rows { |
| 154 | +``` |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +#### 4. **Edge Case: Memory Calculation Under Investigation** |
| 159 | +**Location**: `src/dataframe.rs` lines 1376-1386 |
| 160 | + |
| 161 | +**Current memory scaling logic**: |
| 162 | +```rust |
| 163 | +if size_estimate_so_far > max_bytes { |
| 164 | + let ratio = max_bytes as f32 / size_estimate_so_far as f32; |
| 165 | + let total_rows = rows_in_rb + rows_so_far; |
| 166 | + let mut reduced_row_num = (total_rows as f32 * ratio).round() as usize; |
| 167 | + if reduced_row_num < min_rows { |
| 168 | + reduced_row_num = min_rows.min(total_rows); |
| 169 | + } |
| 170 | +``` |
| 171 | + |
| 172 | +**Concern**: Floating-point rounding could result in: |
| 173 | +- Very small dataframes: aggressive rounding down, then min_rows override might show fewer rows than requested |
| 174 | +- Very large memory requests: ratio-based calculation might not distribute row reductions fairly across batches |
| 175 | + |
| 176 | +**Suggestions for future improvement** (not blocking): |
| 177 | +1. Add test cases for boundary conditions (DataFrame at exactly max_bytes, just under, just over) |
| 178 | +2. Consider deterministic rounding strategy instead of `f32::round()` |
| 179 | +3. Document expected behavior when memory limit is very close to actual usage |
| 180 | + |
| 181 | +--- |
| 182 | + |
| 183 | +### Severity: LOW (Polish/Documentation) |
| 184 | + |
| 185 | +#### 5. **Backward-Compatibility Alias Could Be Marked @deprecated** |
| 186 | +**Location**: `dataframe_formatter.py` lines 241-242 |
| 187 | + |
| 188 | +```python |
| 189 | +self.max_rows = max_rows |
| 190 | +# Backwards-compatible alias |
| 191 | +self.repr_rows = max_rows |
| 192 | +``` |
| 193 | + |
| 194 | +**Suggestion**: Consider adding property with deprecation warning: |
| 195 | +```python |
| 196 | +@property |
| 197 | +def repr_rows(self) -> int: |
| 198 | + """Deprecated: use max_rows instead.""" |
| 199 | + return self.max_rows |
| 200 | + |
| 201 | +@repr_rows.setter |
| 202 | +def repr_rows(self, value: int) -> None: |
| 203 | + warnings.warn("repr_rows is deprecated, use max_rows", DeprecationWarning) |
| 204 | + self.max_rows = value |
| 205 | +``` |
| 206 | +This would surface deprecation to direct attribute access as well. |
| 207 | + |
| 208 | +--- |
| 209 | + |
| 210 | +#### 6. **Test Coverage Gap: Deprecated repr_rows Path** |
| 211 | +**Location**: `python/tests/test_dataframe.py` |
| 212 | + |
| 213 | +**Observation**: Tests thoroughly cover the new `max_rows` parameter but don't explicitly test that deprecated `repr_rows` still works (even in backward-compatibility mode). |
| 214 | + |
| 215 | +**Suggestion**: Add a test: |
| 216 | +```python |
| 217 | +def test_repr_rows_backward_compatibility(clean_formatter_state): |
| 218 | + """Verify that repr_rows parameter still works as deprecated alias.""" |
| 219 | + # Should work when not conflicting with max_rows |
| 220 | + formatter = DataFrameHtmlFormatter(repr_rows=5) |
| 221 | + assert formatter.max_rows == 5 |
| 222 | + assert formatter.repr_rows == 5 |
| 223 | + |
| 224 | + # Should fail when conflicting |
| 225 | + with pytest.raises(ValueError, match="Specify only max_rows"): |
| 226 | + DataFrameHtmlFormatter(repr_rows=5, max_rows=10) |
| 227 | +``` |
| 228 | + |
| 229 | +--- |
| 230 | + |
| 231 | +#### 7. **Documentation Could Be Clearer on Default Relationship** |
| 232 | +**Location**: `docs/source/user-guide/dataframe/rendering.rst` |
| 233 | + |
| 234 | +The documentation mentions setting `min_rows_display=20, max_rows=20` to get exactly 20 rows, but doesn't explain: |
| 235 | +- What happens if DataFrame has < min_rows_display rows |
| 236 | +- How memory limit interacts with both parameters |
| 237 | +- Precedence order: Which limit triggers first? |
| 238 | + |
| 239 | +**Suggestion**: Add a "Parameters Interaction" section: |
| 240 | +``` |
| 241 | +The display respects these constraints in the following order: |
| 242 | +1. Memory limit (max_memory_bytes) applies first |
| 243 | +2. Row limit (max_rows) applies second |
| 244 | +3. Minimum guarantee (min_rows_display) overrides both if needed |
| 245 | +
|
| 246 | +To display exactly N rows: set min_rows_display=N and max_rows=N with sufficient memory. |
| 247 | +``` |
| 248 | + |
| 249 | +--- |
| 250 | + |
| 251 | +#### 8. **Consider Helper Function for Configuration Validation** |
| 252 | +**Location**: Both `dataframe_formatter.py` and `src/dataframe.rs` |
| 253 | + |
| 254 | +**Observation**: Parameter validation logic is duplicated: |
| 255 | +- Python: checks in `__init__` method |
| 256 | +- Rust: checks in `validate()` method |
| 257 | + |
| 258 | +**Suggestion for future refactoring** (not blocking for this PR): |
| 259 | +Extract to shared validation logic or document the validation contract more explicitly to reduce duplication risk during maintenance. |
| 260 | + |
| 261 | +--- |
| 262 | + |
| 263 | +## Behavior Verification |
| 264 | + |
| 265 | +### ✅ The Bug is Fixed |
| 266 | +**Before**: `min_rows=20 > repr_rows=10` → condition always true → memory limit ignored |
| 267 | +**After**: `min_rows_display=10 <= max_rows=10` → memory limit can trigger → behavior correct |
| 268 | + |
| 269 | +### ✅ Integration Points |
| 270 | +- Python ↔ Rust: FFI properly marshals new parameter names |
| 271 | +- Direct instantiation ↔ `configure_formatter()`: Both paths validated |
| 272 | +- Attribute access: `formatter.repr_rows` maintains backward compatibility |
| 273 | + |
| 274 | +--- |
| 275 | + |
| 276 | +## Decision |
| 277 | + |
| 278 | +### ✅ **APPROVE WITH SUGGESTIONS** |
| 279 | + |
| 280 | +**Reasoning**: |
| 281 | +- The PR solves the core problem (memory limit now respected) |
| 282 | +- Implementation is solid with good test coverage |
| 283 | +- Naming is improved for clarity |
| 284 | +- Backward compatibility is maintained |
| 285 | +- Non-blocking suggestions focus on polish and documentation |
| 286 | + |
| 287 | +**Blocking Issues**: None |
| 288 | + |
| 289 | +**Recommended Follow-ups** (for future work, not this PR): |
| 290 | +1. Add deprecation warnings for `repr_rows` parameter |
| 291 | +2. Improve Rust FFI logic clarity with helper function |
| 292 | +3. Expand documentation on parameter interactions |
| 293 | +4. Add test coverage for deprecated parameter path |
| 294 | +5. Validate floating-point rounding behavior in edge cases |
| 295 | + |
| 296 | +--- |
| 297 | + |
| 298 | +## Inline Comments by File |
| 299 | + |
| 300 | +### `python/datafusion/dataframe_formatter.py` |
| 301 | + |
| 302 | +| Line | Suggestion | |
| 303 | +|------|-----------| |
| 304 | +| 207-210 | Consider adding deprecation warning alongside ValueError | |
| 305 | +| 241-242 | Could use @property decorator with warning for `repr_rows` | |
| 306 | +| 235-242 | Extract validation logic to separate method for testability | |
| 307 | + |
| 308 | +### `src/dataframe.rs` |
| 309 | + |
| 310 | +| Line | Suggestion | |
| 311 | +|------|-----------| |
| 312 | +| 156-161 | Extract max_rows resolution to helper function (reduce shadowing) | |
| 313 | +| 1366 | Add comment explaining min_rows override semantics | |
| 314 | +| 1376-1386 | Consider test for boundary conditions (exact max_bytes size) | |
| 315 | + |
| 316 | +### `python/tests/test_dataframe.py` |
| 317 | + |
| 318 | +| Line | Suggestion | |
| 319 | +|------|-----------| |
| 320 | +| 1467-1475 | Add test for deprecated repr_rows backward compatibility | |
| 321 | +| 1481-1511 | Add test for memory boundary conditions | |
| 322 | + |
| 323 | +### `docs/source/user-guide/dataframe/rendering.rst` |
| 324 | + |
| 325 | +| Line | Suggestion | |
| 326 | +|------|-----------| |
| 327 | +| 58-65 | Add "Parameters Interaction" section explaining precedence | |
| 328 | +| 191+ | Document behavior when DataFrame rows < min_rows_display | |
| 329 | + |
| 330 | +--- |
| 331 | + |
| 332 | +## Questions for Author |
| 333 | + |
| 334 | +1. Was the choice to silently accept matching `repr_rows` values intentional, or should this always raise a deprecation warning? |
| 335 | +2. Is the Rust-side precedence (repr_rows overrides max_rows if both exist) intentional to match Python behavior? |
| 336 | +3. Are there any known issues with floating-point rounding in edge cases with very small or very large DataFrames? |
| 337 | + |
0 commit comments