Skip to content

fix: checkpoint sync follow-head fixes for epbs-devnet-1#9156

Draft
lodekeeper wants to merge 5 commits intoChainSafe:epbs-devnet-1from
lodekeeper:fix/epbs-devnet1-checkpoint-follow-head-min
Draft

fix: checkpoint sync follow-head fixes for epbs-devnet-1#9156
lodekeeper wants to merge 5 commits intoChainSafe:epbs-devnet-1from
lodekeeper:fix/epbs-devnet1-checkpoint-follow-head-min

Conversation

@lodekeeper
Copy link
Copy Markdown
Contributor

@lodekeeper lodekeeper commented Apr 2, 2026

Motivation

After checkpoint sync on epbs-devnet-1, Lodestar can fail to start finalized range sync and follow head due to two independent client-side bugs:

  1. Peer classification: when the local node is stalled behind the wall clock, a peer with higher finalizedEpoch and higher headSlot can be incorrectly classified as FullySynced instead of Advanced if its head falls within the slot-import tolerance range. This prevents finalized range sync from starting.

  2. Missing parent envelope: during unknown-block processing, PRESTATE_MISSING can occur because the parent block's FULL variant (execution payload envelope) is still absent. The existing sync path did not proactively resolve the missing parent envelope before retrying, leaving the block stuck in the download queue.

Changes

Peer sync classification fix (remoteSyncType.ts, sync.ts)

  • Add currentSlot parameter to getPeerSyncType
  • Before applying the close-in-range FullySynced shortcut when remote.finalizedEpoch > local.finalizedEpoch, check whether the local head is stalled behind the wall clock
  • If local is behind the clock and remote has both higher finalized epoch and higher head slot, classify the peer as Advanced so range sync can begin
  • Add regression test for this case

Missing parent envelope recovery (unknownBlock.ts)

  • On PRESTATE_MISSING error during block import, check whether the parent block's FULL variant is absent
  • If absent, proactively fetch the parent's execution payload envelope via reqresp before retrying
  • Gate the envelope fetch on explicit FULL absence to avoid unnecessary requests when the parent envelope already exists

Evidence & Limitations

Live testing on epbs-devnet-1 showed that the classification fix causes Lodestar to correctly enter finalized range sync instead of staying in the fully-synced path. The misclassification was observed with a non-official peer (self-identifying as erigon/caplin) that connected to the network — the official epbs-devnet-1 network only runs Prysm and Lodestar CL clients.

That same session then hit a separate outgoing beacon_blocks_by_range V2 INVALID_REQUEST (SSZ_SNAPPY_ERROR_UNDER_SSZ_MIN_SIZE), which is not addressed by this PR. The direct-host repro window later became unstable (the same host anchor alternated between behind / peerless / refused states), so this PR does not claim a complete end-to-end live-devnet fix.

This PR is limited to the two client-side logic fixes that are currently best supported by the available evidence. The downstream req/resp interop failure with Caplin remains a separate follow-up investigation.

@lodekeeper lodekeeper requested a review from a team as a code owner April 2, 2026 10:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the block synchronization process by proactively fetching missing parent envelopes for Gloas blocks when encountering unknown parent or missing prestate errors. This change aims to resolve issues where head blocks are gossiped before the node has fully synced the required parent data. A review comment suggests optimizing the envelope resolution logic by limiting the number of peers queried sequentially to avoid potential delays in the block processing pipeline.

Comment thread packages/beacon-node/src/sync/unknownBlock.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6618b4ee51

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/beacon-node/src/sync/unknownBlock.ts
@lodekeeper lodekeeper changed the title fix: recover checkpoint sync from missing FULL parent in unknownBlock fix: checkpoint sync follow-head fixes for epbs-devnet-1 Apr 2, 2026
@lodekeeper lodekeeper force-pushed the fix/epbs-devnet1-checkpoint-follow-head-min branch 2 times, most recently from a2747d7 to b841f9f Compare April 2, 2026 22:01
@nflaig nflaig marked this pull request as draft April 3, 2026 14:59
…ock sync

After checkpoint sync + range sync, the head block's execution payload
envelope may be missing (already gossipped before we connected). This
leaves the head in PENDING state without a FULL variant.

When new gossip blocks arrive expecting a FULL parent, they fail with
BLOCK_ERROR_PRESTATE_MISSING / REGEN_ERROR_BLOCK_NOT_IN_FORKCHOICE
and the node gets permanently stuck.

Fix: In the PRESTATE_MISSING error handler, detect when the failure is
caused by a missing FULL variant (using the existing Gloas retry context
check) and proactively fetch the parent's envelope via reqresp before
retrying. This reuses the existing resolveEnvelopeForBlock method which
tries gossip cache first, then falls back to ExecutionPayloadEnvelopesByRoot.

Tested: Local node on epbs-devnet-1 with checkpoint sync - fix triggers
correctly, envelope fetched via reqresp, head advances and tracks chain.
Address review feedback: getGloasInvalidStateRootRetryContext reads the
default (PENDING) variant, so wantsFullParent can be true even when the
FULL variant already exists. Gate resolveEnvelopeForBlock on an explicit
getBlockHex(parentRoot, PayloadStatus.FULL) check.
…head

When the local node is stalled behind the wall clock (e.g. after
checkpoint sync), a peer with higher finalizedEpoch and higher
headSlot could be incorrectly classified as FullySynced if its head
fell within the slot-import tolerance range. This prevented finalized
range sync from starting.

Add a currentSlot parameter to getPeerSyncType and check whether
the local head is stalled behind the clock before applying the
close-in-range FullySynced shortcut. When the local node is behind
the clock and the remote has both higher finalized epoch and higher
head slot, classify the peer as Advanced so range sync can begin.

Observed on epbs-devnet-1 where a non-official peer (self-identifying
as erigon/caplin) connected to the network and was being misclassified
as FullySynced despite having higher finalized epoch. Note: the
official epbs-devnet-1 network only runs Prysm and Lodestar CL
clients; the erigon/caplin peer was a third-party node.
…stic

Add three edge case tests:
1. Known head root still returns FullySynced even when local is stalled
2. Exact tolerance boundary (currentSlot === headSlot + tolerance) stays FullySynced
3. Remote head not actually ahead of local stays FullySynced even when stalled

These prove the hasBlock check takes precedence, the boundary is
strict (greater-than, not greater-or-equal), and that remote must
actually be ahead for the Advanced classification to fire.
Move the pendingBlock.status = downloaded assignment before the
resolveEnvelopeForBlock call so that if the envelope import triggers
executionPayloadAvailable -> triggerUnknownBlockSearch, this block is
already in retryable state instead of stuck in processing.
@lodekeeper lodekeeper force-pushed the fix/epbs-devnet1-checkpoint-follow-head-min branch from 9b5ba50 to ed6018a Compare April 3, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant