fix: checkpoint sync follow-head fixes for epbs-devnet-1#9156
fix: checkpoint sync follow-head fixes for epbs-devnet-1#9156lodekeeper wants to merge 5 commits intoChainSafe:epbs-devnet-1from
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances the block synchronization process by proactively fetching missing parent envelopes for Gloas blocks when encountering unknown parent or missing prestate errors. This change aims to resolve issues where head blocks are gossiped before the node has fully synced the required parent data. A review comment suggests optimizing the envelope resolution logic by limiting the number of peers queried sequentially to avoid potential delays in the block processing pipeline.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6618b4ee51
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
a2747d7 to
b841f9f
Compare
…ock sync After checkpoint sync + range sync, the head block's execution payload envelope may be missing (already gossipped before we connected). This leaves the head in PENDING state without a FULL variant. When new gossip blocks arrive expecting a FULL parent, they fail with BLOCK_ERROR_PRESTATE_MISSING / REGEN_ERROR_BLOCK_NOT_IN_FORKCHOICE and the node gets permanently stuck. Fix: In the PRESTATE_MISSING error handler, detect when the failure is caused by a missing FULL variant (using the existing Gloas retry context check) and proactively fetch the parent's envelope via reqresp before retrying. This reuses the existing resolveEnvelopeForBlock method which tries gossip cache first, then falls back to ExecutionPayloadEnvelopesByRoot. Tested: Local node on epbs-devnet-1 with checkpoint sync - fix triggers correctly, envelope fetched via reqresp, head advances and tracks chain.
Address review feedback: getGloasInvalidStateRootRetryContext reads the default (PENDING) variant, so wantsFullParent can be true even when the FULL variant already exists. Gate resolveEnvelopeForBlock on an explicit getBlockHex(parentRoot, PayloadStatus.FULL) check.
…head When the local node is stalled behind the wall clock (e.g. after checkpoint sync), a peer with higher finalizedEpoch and higher headSlot could be incorrectly classified as FullySynced if its head fell within the slot-import tolerance range. This prevented finalized range sync from starting. Add a currentSlot parameter to getPeerSyncType and check whether the local head is stalled behind the clock before applying the close-in-range FullySynced shortcut. When the local node is behind the clock and the remote has both higher finalized epoch and higher head slot, classify the peer as Advanced so range sync can begin. Observed on epbs-devnet-1 where a non-official peer (self-identifying as erigon/caplin) connected to the network and was being misclassified as FullySynced despite having higher finalized epoch. Note: the official epbs-devnet-1 network only runs Prysm and Lodestar CL clients; the erigon/caplin peer was a third-party node.
…stic Add three edge case tests: 1. Known head root still returns FullySynced even when local is stalled 2. Exact tolerance boundary (currentSlot === headSlot + tolerance) stays FullySynced 3. Remote head not actually ahead of local stays FullySynced even when stalled These prove the hasBlock check takes precedence, the boundary is strict (greater-than, not greater-or-equal), and that remote must actually be ahead for the Advanced classification to fire.
Move the pendingBlock.status = downloaded assignment before the resolveEnvelopeForBlock call so that if the envelope import triggers executionPayloadAvailable -> triggerUnknownBlockSearch, this block is already in retryable state instead of stuck in processing.
9b5ba50 to
ed6018a
Compare
Motivation
After checkpoint sync on epbs-devnet-1, Lodestar can fail to start finalized range sync and follow head due to two independent client-side bugs:
Peer classification: when the local node is stalled behind the wall clock, a peer with higher
finalizedEpochand higherheadSlotcan be incorrectly classified asFullySyncedinstead ofAdvancedif its head falls within the slot-import tolerance range. This prevents finalized range sync from starting.Missing parent envelope: during unknown-block processing,
PRESTATE_MISSINGcan occur because the parent block's FULL variant (execution payload envelope) is still absent. The existing sync path did not proactively resolve the missing parent envelope before retrying, leaving the block stuck in the download queue.Changes
Peer sync classification fix (
remoteSyncType.ts,sync.ts)currentSlotparameter togetPeerSyncTypeFullySyncedshortcut whenremote.finalizedEpoch > local.finalizedEpoch, check whether the local head is stalled behind the wall clockAdvancedso range sync can beginMissing parent envelope recovery (
unknownBlock.ts)PRESTATE_MISSINGerror during block import, check whether the parent block's FULL variant is absentEvidence & Limitations
Live testing on epbs-devnet-1 showed that the classification fix causes Lodestar to correctly enter finalized range sync instead of staying in the fully-synced path. The misclassification was observed with a non-official peer (self-identifying as erigon/caplin) that connected to the network — the official epbs-devnet-1 network only runs Prysm and Lodestar CL clients.
That same session then hit a separate outgoing
beacon_blocks_by_rangeV2INVALID_REQUEST(SSZ_SNAPPY_ERROR_UNDER_SSZ_MIN_SIZE), which is not addressed by this PR. The direct-host repro window later became unstable (the same host anchor alternated between behind / peerless / refused states), so this PR does not claim a complete end-to-end live-devnet fix.This PR is limited to the two client-side logic fixes that are currently best supported by the available evidence. The downstream req/resp interop failure with Caplin remains a separate follow-up investigation.