|
| 1 | +# PR Response to Review Comments |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Thank you @timsaucer for the detailed review! Your comments have identified several important improvements needed to make the deprecation path smoother and the implementation more robust. Below are responses and action plans for each comment. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Documentation: `docs/source/user-guide/dataframe/rendering.rst` |
| 10 | + |
| 11 | +### Comment 1: Lines 60-61 - Inconsistent defaults and naming |
| 12 | + |
| 13 | +**Issue:** |
| 14 | +```python |
| 15 | +min_rows_display = 20, # Minimum number of rows to display |
| 16 | +repr_rows = 10, # Number of rows to display in __repr__ |
| 17 | +``` |
| 18 | + |
| 19 | +**Comment:** |
| 20 | +> It looks like the default here has `min_rows > max_rows`. Also should we have consistent naming of the two? |
| 21 | +> Either `min_rows` and `max_rows` or `min_rows_display` and `max_rows_display`? |
| 22 | +> I think the `_display` was differentiating what happens during a `display()` call vs `__repr__`, but I think these values get used during both calls. |
| 23 | +
|
| 24 | +**Response & Action Plan:** |
| 25 | + |
| 26 | +You're absolutely right on both points: |
| 27 | + |
| 28 | +1. **Default values issue**: Having `min_rows_display=20` and `max_rows=10` creates an invalid configuration. Both parameters use the same rendering logic path, so this is incorrect. |
| 29 | + |
| 30 | +2. **Naming consistency**: The `_display` suffix is misleading since both parameters apply to all rendering contexts (both `__repr__` and explicit `display()` calls). |
| 31 | + |
| 32 | +**Action items:** |
| 33 | +- [ ] Change defaults in documentation to `min_rows_display=10, max_rows=10` (or similar valid configuration like `min_rows_display=5, max_rows=10`) |
| 34 | +- [ ] Consider renaming to either: |
| 35 | + - Option A: `min_rows` and `max_rows` (simpler, clearer) |
| 36 | + - Option B: `min_rows_display` and `max_rows_display` (more verbose but consistent) |
| 37 | +- [ ] Update all examples in documentation to use consistent naming |
| 38 | +- [ ] Ensure validation logic enforces `min_rows_display <= max_rows` |
| 39 | + |
| 40 | +**Recommendation:** I lean toward Option A (`min_rows`/`max_rows`) for simplicity unless there's a strong reason for the `_display` suffix. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +### Comment 2: Lines 193-194 - Same naming/default issue |
| 45 | + |
| 46 | +**Issue:** |
| 47 | +```python |
| 48 | +min_rows_display = 50, # Always show at least 50 rows |
| 49 | +max_rows = 20 # Show 20 rows in __repr__ output |
| 50 | +``` |
| 51 | + |
| 52 | +**Comment:** |
| 53 | +> Same as above, difference between `_display` and without the trailer and also we have here `min_rows > max_rows`. |
| 54 | +
|
| 55 | +**Response & Action Plan:** |
| 56 | + |
| 57 | +Same issues as above in a different example. |
| 58 | + |
| 59 | +**Action items:** |
| 60 | +- [ ] Fix the example to use valid configuration: `min_rows_display=20, max_rows=50` (or similar) |
| 61 | +- [ ] Apply consistent naming per the decision in Comment 1 |
| 62 | +- [ ] Review all other documentation examples for similar issues |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## Python: `python/datafusion/dataframe_formatter.py` |
| 67 | + |
| 68 | +### Comment 3: Lines 167-169/256 - Duplicate type/default information |
| 69 | + |
| 70 | +**Issue:** |
| 71 | +```python |
| 72 | +min_rows_display : int, default 10 |
| 73 | + Minimum number of rows to display. |
| 74 | +repr_rows : int, default 10 |
| 75 | + Default number of rows to display in repr output. |
| 76 | +``` |
| 77 | + |
| 78 | +**Comment:** |
| 79 | +> It's not about this PR per se, but maybe this is an opportunity to tighten up the comments here. We're repeating ourselves with the types and defaults. Those are already in the type hints. |
| 80 | +> I think it's becoming customary to not duplicate that information and the argument line is the preferred place to keep it. That way we don't have to worry about maintaining the values in two places. |
| 81 | +
|
| 82 | +**Response & Action Plan:** |
| 83 | + |
| 84 | +Excellent point! This is indeed Python best practice. Since type hints and default values are already in the signature, the docstring should focus on **what** the parameter does and **why** you'd use it, not repeat information that's already in the code. |
| 85 | + |
| 86 | +**Action items:** |
| 87 | +- [ ] Refactor docstring to follow NumPy/pandas style without type/default duplication: |
| 88 | + ```python |
| 89 | + Parameters |
| 90 | + ---------- |
| 91 | + max_cell_length |
| 92 | + Maximum length of cell content before truncation. |
| 93 | + max_width |
| 94 | + Maximum width of the displayed table in pixels. |
| 95 | + max_memory_bytes |
| 96 | + Maximum memory in bytes for rendered data. Helps prevent performance |
| 97 | + issues with large datasets. |
| 98 | + min_rows_display |
| 99 | + Minimum number of rows to display even if memory limit is reached. |
| 100 | + Must not exceed max_rows. |
| 101 | + max_rows |
| 102 | + Maximum number of rows to display. Takes precedence over memory limits |
| 103 | + when fewer rows are requested. |
| 104 | + ``` |
| 105 | +- [ ] Apply this pattern consistently across all docstrings in the file |
| 106 | +- [ ] Keep detailed explanation of behavior/constraints in the description section |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +### Comment 4 & 5: Lines 335-338, 351-354 - Why add accessors for deprecated property? |
| 111 | + |
| 112 | +**Issue:** |
| 113 | +```python |
| 114 | +@property |
| 115 | +def repr_rows(self) -> int: |
| 116 | + """Get the maximum number of rows (deprecated name)...""" |
| 117 | + return self._max_rows |
| 118 | + |
| 119 | +@repr_rows.setter |
| 120 | +def repr_rows(self, value: int) -> None: |
| 121 | + """Set the maximum number of rows using deprecated name...""" |
| 122 | + warnings.warn(...) |
| 123 | + self._max_rows = value |
| 124 | +``` |
| 125 | + |
| 126 | +**Comment:** |
| 127 | +> If `repr_rows` is being deprecated, why add an accessor? |
| 128 | +> Same, why add for deprecated? |
| 129 | +
|
| 130 | +**Response & Action Plan:** |
| 131 | + |
| 132 | +This is a great question about deprecation strategy. The accessors are being added for **backward compatibility** during the deprecation period: |
| 133 | + |
| 134 | +**Rationale:** |
| 135 | +1. **User code may directly access the property**: Code like `formatter.repr_rows = 20` needs to continue working during the deprecation period |
| 136 | +2. **Graceful migration path**: Users get a warning but their code doesn't break |
| 137 | +3. **Custom formatter implementations**: External code that inherits from the formatter and accesses `repr_rows` directly will continue to work |
| 138 | + |
| 139 | +However, I see your point - if users are unlikely to access these properties directly (most use `configure_formatter()`), then these accessors may be overkill. |
| 140 | + |
| 141 | +**Questions for consideration:** |
| 142 | +- Do we know of any user code that directly accesses `formatter.repr_rows`? |
| 143 | +- What's our deprecation timeline? (1 release cycle, 2 release cycles?) |
| 144 | +- Should we follow a "hard break" vs "soft deprecation" approach? |
| 145 | + |
| 146 | +**Proposed approach:** |
| 147 | +- [ ] Keep the accessors for at least 1-2 release cycles to allow users to migrate |
| 148 | +- [ ] Add prominent deprecation warnings in release notes |
| 149 | +- [ ] Remove in a future major version (e.g., next major version bump) |
| 150 | +- [ ] Alternative: If direct property access is rare, we could remove accessors and only handle `repr_rows` in `configure_formatter()` |
| 151 | + |
| 152 | +**Recommendation:** Keep the accessors for now with clear deprecation warnings, plan removal in next major version. |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## Tests: `python/tests/test_dataframe.py` |
| 157 | + |
| 158 | +### Comment 6: Line 1460-1462 - Use `large_df` instead of `df` |
| 159 | + |
| 160 | +**Issue:** |
| 161 | +```python |
| 162 | +def test_html_formatter_memory_boundary_conditions(df, clean_formatter_state): |
| 163 | +``` |
| 164 | + |
| 165 | +**Comment:** |
| 166 | +> Maybe switch to `large_df` instead of `df` here? |
| 167 | +
|
| 168 | +**Response & Action Plan:** |
| 169 | + |
| 170 | +Excellent suggestion! The `large_df` fixture (100,000 rows) is much better suited for testing memory boundary conditions than the standard `df` fixture (3 rows). |
| 171 | + |
| 172 | +**Action items:** |
| 173 | +- [ ] Change test signature to use `large_df` fixture |
| 174 | +- [ ] Update test to work with larger dataset |
| 175 | +- [ ] Adjust assertions to account for the larger data size |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +### Comment 7: Lines 1468-1471 - Adjust max_rows for large_df |
| 180 | + |
| 181 | +**Issue:** |
| 182 | +```python |
| 183 | +configure_formatter(max_memory_bytes=10 * MB, min_rows_display=1, max_rows=100) |
| 184 | +``` |
| 185 | + |
| 186 | +**Comment:** |
| 187 | +> If you do switch to `large_df` then I think it may go above the 100 limit you have. |
| 188 | +
|
| 189 | +**Response & Action Plan:** |
| 190 | + |
| 191 | +Good catch! With `large_df` containing 100,000 rows, the `max_rows=100` limit would be hit before we get useful memory limit testing. |
| 192 | + |
| 193 | +**Action items:** |
| 194 | +- [ ] Increase `max_rows` to a value well above what we expect to collect (e.g., `max_rows=100000`) |
| 195 | +- [ ] Ensure we're actually testing memory limits, not row limits |
| 196 | +- [ ] Add explicit comments about why we set these values |
| 197 | + |
| 198 | +--- |
| 199 | + |
| 200 | +### Comment 8: Lines 1473-1476 - Test actual early termination of stream |
| 201 | + |
| 202 | +**Issue:** |
| 203 | +```python |
| 204 | +# Test 1: Very small memory limit should still respect min_rows |
| 205 | +configure_formatter(max_memory_bytes=10, min_rows_display=1) |
| 206 | +``` |
| 207 | + |
| 208 | +**Comment:** |
| 209 | +> I think a better test is one where we *do* hit the memory limit well before we hit the min number of rows, hence the recommendation to switch to `large_df`. |
| 210 | +> Actually, maybe we want a different dataframe, one where we know we have multiple batches instead of a single batch. The thing this isn't doing is verifying we've ended the stream early, but I think that would have to be a rust test instead of a pytest. |
| 211 | +
|
| 212 | +**Response & Action Plan:** |
| 213 | + |
| 214 | +This is an excellent observation about the limitations of the current test. You're identifying two key testing gaps: |
| 215 | + |
| 216 | +1. **Multi-batch streaming**: The current test may use a single-batch DataFrame, so we never actually test stream termination |
| 217 | +2. **Rust-level verification**: Python tests can't verify that the stream was actually closed early vs. all data being collected and then truncated |
| 218 | + |
| 219 | +**Action items:** |
| 220 | + |
| 221 | +**Python test improvements:** |
| 222 | +- [ ] Create a test DataFrame that definitely spans multiple record batches |
| 223 | + ```python |
| 224 | + # Create a multi-batch DataFrame explicitly |
| 225 | + batches = [pa.record_batch({"a": range(10000), "b": [f"str_{i}" for i in range(10000)]}) |
| 226 | + for _ in range(10)] |
| 227 | + large_multi_batch_df = ctx.from_arrow(batches) |
| 228 | + ``` |
| 229 | +- [ ] Test with memory limit that would be exceeded by 2-3 batches but not 1 batch |
| 230 | +- [ ] Verify behavior: partial data + truncation message + respects min_rows |
| 231 | + |
| 232 | +**Rust test considerations:** |
| 233 | +- [ ] Add Rust unit test in `src/dataframe.rs` that: |
| 234 | + - Creates a mock stream with known batch sizes |
| 235 | + - Sets a memory limit that should trigger after N batches |
| 236 | + - Verifies that only N batches were consumed (stream was closed early) |
| 237 | + - Verifies `has_more_rows` flag is set correctly |
| 238 | + |
| 239 | +**Example Rust test structure:** |
| 240 | +```rust |
| 241 | +#[test] |
| 242 | +fn test_stream_early_termination() { |
| 243 | + // Create a stream with known batches |
| 244 | + // Set memory limit to stop after 2 batches |
| 245 | + // Verify only 2 batches were consumed |
| 246 | + // Verify has_more_rows = true |
| 247 | +} |
| 248 | +``` |
| 249 | + |
| 250 | +**Note:** The Rust test would be more robust but also more complex. We should discuss if it's worth the effort vs. relying on integration testing. |
| 251 | + |
| 252 | +--- |
| 253 | + |
| 254 | +## Rust: `src/dataframe.rs` |
| 255 | + |
| 256 | +### Comment 9: Lines 150-156 - Backward-compatible attribute lookup |
| 257 | + |
| 258 | +**Issue:** |
| 259 | +```rust |
| 260 | +let max_rows = get_attr(formatter, "max_rows", default_config.max_rows); |
| 261 | +``` |
| 262 | + |
| 263 | +**Comment:** |
| 264 | +> Since users may have provided their own implementation for the formatter, I think we need to first try getting `max_rows`. |
| 265 | +> If that fails, try getting `repr_rows`. If that fails, take default. |
| 266 | +> When we remove `repr_rows` entirely after it's been deprecated for a few releases, then we can revert to this simpler logic. |
| 267 | +
|
| 268 | +**Response & Action Plan:** |
| 269 | + |
| 270 | +This is a **critical point** for backward compatibility! Users with custom formatter implementations won't have `max_rows` attribute yet, only `repr_rows`. The current code would break their formatters. |
| 271 | + |
| 272 | +**Action items:** |
| 273 | +- [ ] Implement fallback logic in Rust: |
| 274 | + ```rust |
| 275 | + // Try new name first, fall back to deprecated name, then default |
| 276 | + let max_rows = get_attr(formatter, "max_rows", 0) |
| 277 | + .or_else(|| get_attr(formatter, "repr_rows", 0)) |
| 278 | + .unwrap_or(default_config.max_rows); |
| 279 | + ``` |
| 280 | + |
| 281 | + Or more explicitly: |
| 282 | + ```rust |
| 283 | + let max_rows = if let Ok(value) = formatter.getattr("max_rows") { |
| 284 | + value.extract::<usize>().unwrap_or(default_config.max_rows) |
| 285 | + } else if let Ok(value) = formatter.getattr("repr_rows") { |
| 286 | + value.extract::<usize>().unwrap_or(default_config.max_rows) |
| 287 | + } else { |
| 288 | + default_config.max_rows |
| 289 | + }; |
| 290 | + ``` |
| 291 | + |
| 292 | +- [ ] Add a code comment explaining the fallback chain: |
| 293 | + ```rust |
| 294 | + // Try max_rows first (new name), fall back to repr_rows (deprecated), |
| 295 | + // then use default. This ensures backward compatibility with custom |
| 296 | + // formatter implementations during the deprecation period. |
| 297 | + ``` |
| 298 | + |
| 299 | +- [ ] Add test case for custom formatter with only `repr_rows` attribute |
| 300 | +- [ ] Document deprecation timeline for when we can remove the fallback |
| 301 | + |
| 302 | +**Timeline suggestion:** |
| 303 | +- Release N: Add `max_rows`, deprecate `repr_rows` with warnings |
| 304 | +- Release N+1: Keep both supported |
| 305 | +- Release N+2: Remove `repr_rows` fallback in Rust |
| 306 | +- Release N+3: Remove `repr_rows` property entirely |
| 307 | + |
| 308 | +--- |
| 309 | + |
| 310 | +## Summary of Action Items |
| 311 | + |
| 312 | +### High Priority (Breaks backward compatibility if not addressed) |
| 313 | +1. ✅ **Rust fallback logic** (Comment 9) - Must be fixed to support custom formatters |
| 314 | +2. ✅ **Documentation default values** (Comments 1, 2) - Current examples show invalid configurations |
| 315 | + |
| 316 | +### Medium Priority (Quality & correctness improvements) |
| 317 | +3. ✅ **Test improvements** (Comments 6, 7, 8) - Use `large_df`, test actual streaming behavior |
| 318 | +4. ✅ **Naming consistency** (Comments 1, 2) - Decide on `min_rows`/`max_rows` vs `min_rows_display`/`max_rows_display` |
| 319 | + |
| 320 | +### Low Priority (Code quality & maintainability) |
| 321 | +5. ✅ **Docstring improvements** (Comment 3) - Remove duplicate type/default info |
| 322 | +6. ✅ **Deprecation accessor strategy** (Comments 4, 5) - Document rationale or remove |
| 323 | + |
| 324 | +### Additional Considerations |
| 325 | +- Document deprecation timeline clearly in CHANGELOG |
| 326 | +- Add migration guide for users with custom formatters |
| 327 | +- Consider adding Rust-level tests for stream termination verification |
| 328 | + |
| 329 | +--- |
| 330 | + |
| 331 | +## Questions for Discussion |
| 332 | + |
| 333 | +1. **Naming convention**: Should we standardize on `min_rows`/`max_rows` or keep `min_rows_display`/`max_rows_display`? (My vote: simpler names) |
| 334 | + |
| 335 | +2. **Deprecation timeline**: How many releases should we support `repr_rows` before removal? (My suggestion: 2-3 releases) |
| 336 | + |
| 337 | +3. **Rust tests**: Should we invest in Rust unit tests for stream termination, or are the Python integration tests sufficient? |
| 338 | + |
| 339 | +4. **Default values**: What should the actual defaults be? Current suggestion: `min_rows_display=10, max_rows=10` or `min_rows_display=5, max_rows=10`? |
| 340 | + |
| 341 | +--- |
| 342 | + |
| 343 | +## Implementation Plan |
| 344 | + |
| 345 | +If the above responses look good, I can proceed with implementing the fixes in this order: |
| 346 | + |
| 347 | +1. **Critical fix**: Implement Rust fallback logic for `repr_rows` → `max_rows` transition |
| 348 | +2. **Documentation**: Fix all examples and naming consistency |
| 349 | +3. **Tests**: Switch to `large_df`, add multi-batch test cases |
| 350 | +4. **Docstrings**: Clean up redundant type/default information |
| 351 | +5. **Polish**: Add code comments explaining deprecation strategy |
| 352 | + |
| 353 | +Please let me know if you'd like me to proceed with these changes or if you have different preferences for any of the action items above! |
0 commit comments