refactor: add ColumnFormat implementation for string columns#19410
refactor: add ColumnFormat implementation for string columns#19410jaykanakiya wants to merge 10 commits intoapache:masterfrom
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 2 |
| P2 | 0 |
| P3 | 0 |
| Total | 2 |
This is an automated review by Codex GPT-5
| Integer maxStringLength = columnFormatSpec != null ? columnFormatSpec.getMaxStringLength() : null; | ||
| return new StringDimensionHandler( | ||
| columnName, | ||
| MultiValueHandling.ofDefault(), |
There was a problem hiding this comment.
[P1] Persisted string format drops multi-value handling when rebuilding handlers
getColumnHandler reconstructs the StringDimensionHandler with MultiValueHandling.ofDefault() even when columnFormatSpec contains an explicit multiValueHandling. IndexMergerBase and IncrementalIndex.loadDimensionIterable call getColumnHandler directly, so compacting or loading persisted string columns with ARRAY or SORTED_SET can silently switch back to the default behavior and reorder or deduplicate multi-value strings. Derive the handler's multiValueHandling from columnFormatSpec when present, matching StringDimensionSchema.getDimensionHandler.
| return this; | ||
| } | ||
|
|
||
| if (otherFormat instanceof StringDictionaryEncodedColumnFormat) { |
There was a problem hiding this comment.
[P1] String format merge bypasses type compatibility checks
This delegation happens before CapabilitiesBasedFormat's existing type checks, and StringDictionaryEncodedColumnFormat.merge accepts any CapabilitiesBasedFormat without validating its logical type. If one segment has a string-specific format for a column and another legacy segment reports the same column as LONG, DOUBLE, COMPLEX, etc., the merge now produces a STRING format instead of rejecting the incompatible schemas. Restrict delegation to compatible string capabilities or perform the same logical/element/complex type validation in StringDictionaryEncodedColumnFormat.merge before merging.
Description
Addresses a follow up from #19258
Adds StringDictionaryEncodedColumnFormat that persists StringColumnFormatSpec in DictionaryEncodedColumnPartSerde segment metadata.
Key changed/added classes in this PR
StringDictionaryEncodedColumnFormatDictionaryEncodedColumnPartSerdeStringDimensionIndexerStringDimensionHandlerThis PR has: