fix: checkpoint sync follow-head fixes for epbs-devnet-1 by lodekeeper · Pull Request #9156 · ChainSafe/lodestar

lodekeeper · 2026-04-02T10:49:30Z

Motivation

After checkpoint sync on epbs-devnet-1, Lodestar can fail to start finalized range sync and follow head due to two independent client-side bugs:

Peer classification: when the local node is stalled behind the wall clock, a peer with higher finalizedEpoch and higher headSlot can be incorrectly classified as FullySynced instead of Advanced if its head falls within the slot-import tolerance range. This prevents finalized range sync from starting.
Missing parent envelope: during unknown-block processing, PRESTATE_MISSING can occur because the parent block's FULL variant (execution payload envelope) is still absent. The existing sync path did not proactively resolve the missing parent envelope before retrying, leaving the block stuck in the download queue.

Changes

Peer sync classification fix (`remoteSyncType.ts`, `sync.ts`)

Add currentSlot parameter to getPeerSyncType
Before applying the close-in-range FullySynced shortcut when remote.finalizedEpoch > local.finalizedEpoch, check whether the local head is stalled behind the wall clock
If local is behind the clock and remote has both higher finalized epoch and higher head slot, classify the peer as Advanced so range sync can begin
Add regression test for this case

Missing parent envelope recovery (`unknownBlock.ts`)

On PRESTATE_MISSING error during block import, check whether the parent block's FULL variant is absent
If absent, proactively fetch the parent's execution payload envelope via reqresp before retrying
Gate the envelope fetch on explicit FULL absence to avoid unnecessary requests when the parent envelope already exists

Evidence & Limitations

Live testing on epbs-devnet-1 showed that the classification fix causes Lodestar to correctly enter finalized range sync instead of staying in the fully-synced path. The misclassification was observed with a non-official peer (self-identifying as erigon/caplin) that connected to the network — the official epbs-devnet-1 network only runs Prysm and Lodestar CL clients.

That same session then hit a separate outgoing beacon_blocks_by_range V2 INVALID_REQUEST (SSZ_SNAPPY_ERROR_UNDER_SSZ_MIN_SIZE), which is not addressed by this PR. The direct-host repro window later became unstable (the same host anchor alternated between behind / peerless / refused states), so this PR does not claim a complete end-to-end live-devnet fix.

This PR is limited to the two client-side logic fixes that are currently best supported by the available evidence. The downstream req/resp interop failure with Caplin remains a separate follow-up investigation.

gemini-code-assist

Code Review

This pull request enhances the block synchronization process by proactively fetching missing parent envelopes for Gloas blocks when encountering unknown parent or missing prestate errors. This change aims to resolve issues where head blocks are gossiped before the node has fully synced the required parent data. A review comment suggests optimizing the envelope resolution logic by limiting the number of peers queried sequentially to avoid potential delays in the block processing pipeline.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6618b4ee51

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…ock sync After checkpoint sync + range sync, the head block's execution payload envelope may be missing (already gossipped before we connected). This leaves the head in PENDING state without a FULL variant. When new gossip blocks arrive expecting a FULL parent, they fail with BLOCK_ERROR_PRESTATE_MISSING / REGEN_ERROR_BLOCK_NOT_IN_FORKCHOICE and the node gets permanently stuck. Fix: In the PRESTATE_MISSING error handler, detect when the failure is caused by a missing FULL variant (using the existing Gloas retry context check) and proactively fetch the parent's envelope via reqresp before retrying. This reuses the existing resolveEnvelopeForBlock method which tries gossip cache first, then falls back to ExecutionPayloadEnvelopesByRoot. Tested: Local node on epbs-devnet-1 with checkpoint sync - fix triggers correctly, envelope fetched via reqresp, head advances and tracks chain.

Address review feedback: getGloasInvalidStateRootRetryContext reads the default (PENDING) variant, so wantsFullParent can be true even when the FULL variant already exists. Gate resolveEnvelopeForBlock on an explicit getBlockHex(parentRoot, PayloadStatus.FULL) check.

…head When the local node is stalled behind the wall clock (e.g. after checkpoint sync), a peer with higher finalizedEpoch and higher headSlot could be incorrectly classified as FullySynced if its head fell within the slot-import tolerance range. This prevented finalized range sync from starting. Add a currentSlot parameter to getPeerSyncType and check whether the local head is stalled behind the clock before applying the close-in-range FullySynced shortcut. When the local node is behind the clock and the remote has both higher finalized epoch and higher head slot, classify the peer as Advanced so range sync can begin. Observed on epbs-devnet-1 where a non-official peer (self-identifying as erigon/caplin) connected to the network and was being misclassified as FullySynced despite having higher finalized epoch. Note: the official epbs-devnet-1 network only runs Prysm and Lodestar CL clients; the erigon/caplin peer was a third-party node.

…stic Add three edge case tests: 1. Known head root still returns FullySynced even when local is stalled 2. Exact tolerance boundary (currentSlot === headSlot + tolerance) stays FullySynced 3. Remote head not actually ahead of local stays FullySynced even when stalled These prove the hasBlock check takes precedence, the boundary is strict (greater-than, not greater-or-equal), and that remote must actually be ahead for the Advanced classification to fire.

Move the pendingBlock.status = downloaded assignment before the resolveEnvelopeForBlock call so that if the envelope import triggers executionPayloadAvailable -> triggerUnknownBlockSearch, this block is already in retryable state instead of stuck in processing.

lodekeeper requested a review from a team as a code owner April 2, 2026 10:49

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

Comment thread packages/beacon-node/src/sync/unknownBlock.ts

chatgpt-codex-connector bot reviewed Apr 2, 2026

View reviewed changes

Comment thread packages/beacon-node/src/sync/unknownBlock.ts

lodekeeper changed the title ~~fix: recover checkpoint sync from missing FULL parent in unknownBlock~~ fix: checkpoint sync follow-head fixes for epbs-devnet-1 Apr 2, 2026

lodekeeper force-pushed the fix/epbs-devnet1-checkpoint-follow-head-min branch 2 times, most recently from a2747d7 to b841f9f Compare April 2, 2026 22:01

lodekeeper mentioned this pull request Apr 3, 2026

fix: checkpoint sync follow-head fixes for epbs-devnet-1 #9148

Closed

nflaig marked this pull request as draft April 3, 2026 14:59

lodekeeper added 5 commits April 3, 2026 18:09

lodekeeper force-pushed the fix/epbs-devnet1-checkpoint-follow-head-min branch from 9b5ba50 to ed6018a Compare April 3, 2026 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: checkpoint sync follow-head fixes for epbs-devnet-1#9156

fix: checkpoint sync follow-head fixes for epbs-devnet-1#9156
lodekeeper wants to merge 5 commits intoChainSafe:epbs-devnet-1from
lodekeeper:fix/epbs-devnet1-checkpoint-follow-head-min

lodekeeper commented Apr 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lodekeeper commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Peer sync classification fix (remoteSyncType.ts, sync.ts)

Missing parent envelope recovery (unknownBlock.ts)

Evidence & Limitations

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lodekeeper commented Apr 2, 2026 •

edited

Loading

Peer sync classification fix (`remoteSyncType.ts`, `sync.ts`)

Missing parent envelope recovery (`unknownBlock.ts`)