Skip to content

fix: use OpReplace instead of OpOverwrite in ReplaceDataFiles and ReplaceFiles#867

Open
Bahtya wants to merge 3 commits intoapache:mainfrom
Bahtya:main
Open

fix: use OpReplace instead of OpOverwrite in ReplaceDataFiles and ReplaceFiles#867
Bahtya wants to merge 3 commits intoapache:mainfrom
Bahtya:main

Conversation

@Bahtya
Copy link
Copy Markdown

@Bahtya Bahtya commented Apr 9, 2026

Summary

Fixes #841 (parent #832)

Changes ReplaceDataFiles, ReplaceDataFilesWithDataFiles, and ReplaceFiles to use OpReplace instead of OpOverwrite when creating snapshot updates.

Problem

Per the Iceberg spec, REPLACE is the correct operation when data content is equivalent but reorganized into different files (e.g., compaction). The three replace methods were unconditionally using OpOverwrite despite a TODO comment acknowledging this was incorrect.

Changes

  • table/transaction.go: Changed OpOverwriteOpReplace in three locations:
    • ReplaceDataFiles (line ~418)
    • ReplaceDataFilesWithDataFiles (line ~713)
    • ReplaceFiles (line ~826)
  • Removed the TODO comment at ReplaceDataFiles that acknowledged the incorrect operation type
  • table/replace_files_test.go: Updated TestReplaceFiles_DataAndDeleteFiles to assert OpReplace

Testing

All existing tests pass:

=== RUN   TestReplaceFiles_DataAndDeleteFiles
--- PASS
=== RUN   TestReplaceFiles_DelegatesToReplaceDataFilesWhenNoDeleteFiles
--- PASS
=== RUN   TestReplaceFiles_ValidationErrors
--- PASS

…laceFiles

Per the Iceberg spec, REPLACE is the correct operation when data is
reorganized (e.g., compaction) without changing content. ReplaceDataFiles,
ReplaceDataFilesWithDataFiles, and ReplaceFiles all reorganize data files,
so they should use OpReplace rather than OpOverwrite.

Removes the TODO comment that acknowledged this was incorrect.

Fixes apache#841
@Bahtya Bahtya requested a review from zeroshade as a code owner April 9, 2026 18:02
Bahtya and others added 2 commits April 10, 2026 02:15
Remove extra blank line between doc comment and function declaration.
…ests

Update TestReplaceDataFiles and TestReplaceDataFilesWithDataFiles
to expect OpReplace instead of OpOverwrite in snapshot summaries,
matching the production code change.
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnx for contribution!

Looks good to me.

@Bahtya
Copy link
Copy Markdown
Author

Bahtya commented Apr 11, 2026

Hi team, just wanted to follow up on this PR. It has been reviewed and approved by @laskoviymishka, and all CI checks are passing. Would appreciate if a maintainer could take a look and merge when ready. Thank you!

@Bahtya
Copy link
Copy Markdown
Author

Bahtya commented Apr 12, 2026

@zeroshade Friendly ping 👋 @laskoviymishka has already approved this PR. It fixes a bug where OpOverwrite was used instead of OpReplace in ReplaceDataFileActions. Would appreciate your review for merge.

Comment thread table/transaction.go
Comment on lines -423 to +417
updater := t.updateSnapshot(fs, snapshotProps, OpOverwrite).mergeOverwrite(&commitUUID)
updater := t.updateSnapshot(fs, snapshotProps, OpReplace).mergeOverwrite(&commitUUID)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can only use OpReplace here instead of OpOverwrite if there are no changes to the underlying data. We should probably validate this before we use OpReplace instead of OpOverwrite, right?

Comment thread table/transaction.go
Comment on lines -718 to +712
updater := t.updateSnapshot(fs, snapshotProps, OpOverwrite).mergeOverwrite(&commitUUID)
updater := t.updateSnapshot(fs, snapshotProps, OpReplace).mergeOverwrite(&commitUUID)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, we should validate that there's no actual data changes before using OpReplace, right?

Comment thread table/transaction.go
Comment on lines -831 to +825
updater := t.updateSnapshot(fs, snapshotProps, OpOverwrite).mergeOverwrite(&commitUUID)
updater := t.updateSnapshot(fs, snapshotProps, OpReplace).mergeOverwrite(&commitUUID)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@Bahtya
Copy link
Copy Markdown
Author

Bahtya commented Apr 17, 2026

Hi @zeroshade, thanks for the review! I've updated the PR to validate data changes before using OpReplace in all three replace methods:

  1. ReplaceDataFiles: Compares total record count of deleted files vs added files. Uses OpOverwrite when counts differ.
  2. ReplaceDataFilesWithDataFiles: Same validation pattern.
  3. ReplaceFiles: Accounts for removed delete file records: uses OpOverwrite when (deletedCount - removedDeleteCount) != addedCount.

The key insight: OpReplace is now only used when record counts match (metadata-only reorganization like compaction), and OpOverwrite is used when actual data content has changed. This aligns with the Iceberg spec where REPLACE is for operations that don't change data content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: ReplaceDataFiles should use OpReplace instead of OpOverwrite

3 participants