Skip to content

refactor: add ColumnFormat implementation for string columns#19410

Open
jaykanakiya wants to merge 10 commits intoapache:masterfrom
jaykanakiya:string-column-format-spec
Open

refactor: add ColumnFormat implementation for string columns#19410
jaykanakiya wants to merge 10 commits intoapache:masterfrom
jaykanakiya:string-column-format-spec

Conversation

@jaykanakiya
Copy link
Copy Markdown
Contributor

@jaykanakiya jaykanakiya commented May 5, 2026

Description

Addresses a follow up from #19258

Adds StringDictionaryEncodedColumnFormat that persists StringColumnFormatSpec in DictionaryEncodedColumnPartSerde segment metadata.


Key changed/added classes in this PR
  • StringDictionaryEncodedColumnFormat
  • DictionaryEncodedColumnPartSerde
  • StringDimensionIndexer
  • StringDimensionHandler

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 2
P2 0
P3 0
Total 2

This is an automated review by Codex GPT-5

Integer maxStringLength = columnFormatSpec != null ? columnFormatSpec.getMaxStringLength() : null;
return new StringDimensionHandler(
columnName,
MultiValueHandling.ofDefault(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Persisted string format drops multi-value handling when rebuilding handlers

getColumnHandler reconstructs the StringDimensionHandler with MultiValueHandling.ofDefault() even when columnFormatSpec contains an explicit multiValueHandling. IndexMergerBase and IncrementalIndex.loadDimensionIterable call getColumnHandler directly, so compacting or loading persisted string columns with ARRAY or SORTED_SET can silently switch back to the default behavior and reorder or deduplicate multi-value strings. Derive the handler's multiValueHandling from columnFormatSpec when present, matching StringDimensionSchema.getDimensionHandler.

return this;
}

if (otherFormat instanceof StringDictionaryEncodedColumnFormat) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] String format merge bypasses type compatibility checks

This delegation happens before CapabilitiesBasedFormat's existing type checks, and StringDictionaryEncodedColumnFormat.merge accepts any CapabilitiesBasedFormat without validating its logical type. If one segment has a string-specific format for a column and another legacy segment reports the same column as LONG, DOUBLE, COMPLEX, etc., the merge now produces a STRING format instead of rejecting the incompatible schemas. Restrict delegation to compatible string capabilities or perform the same logical/element/complex type validation in StringDictionaryEncodedColumnFormat.merge before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants