Skip to content

tests/eclient: rotate controller cert twice in ctrl_cert_change#1153

Draft
eriknordmark wants to merge 1 commit intolf-edge:masterfrom
eriknordmark:ctrl-cert-change-twice
Draft

tests/eclient: rotate controller cert twice in ctrl_cert_change#1153
eriknordmark wants to merge 1 commit intolf-edge:masterfrom
eriknordmark:ctrl-cert-change-twice

Conversation

@eriknordmark
Copy link
Copy Markdown
Contributor

Summary

Extend ctrl_cert_change to perform two consecutive controller signing
certificate rotations between the initial eclient1 deploy and the
eclient2 deploy, so that both states of /persist/checkpoint/controllercerts.bak
on the device are covered:

  1. First rotation runs while .bak does not yet exist. EVE's
    auth-verify fallback in VerifyAuthContainerHeader
    (pkg/pillar/controllerconn/authen.go) cannot find the backup
    file and must surface SenderStatusCertMiss so that
    zedagent triggers a /certs fetch. Saving the new chain via
    MaybeSaveControllerCerts populates .bak as a side effect of
    WriteRenameWithBackup.
  2. Second rotation runs after .bak has been populated by the
    first rotation. The backup-fallback path actually loads a cert
    (the previous-but-one cert), determines that its hash still does
    not match the message, and only after that fails does EVE trigger
    /certs again. MaybeSaveControllerCerts also has to refresh an
    existing .bak in place.

Both TestLog invocations match
Rebuilding intended global config, reasons: updating app connectivity.
Because the test bus subscribes with elog.LogNew
(pkg/testcontext/testProc.go:197), the second invocation only sees
logs that arrive after the second rotation, so the two events do not
alias.

Motivation

The single-rotation form of this test was masking a real EVE bug:
when .bak does not exist, the backup attempt in
VerifyAuthContainerHeader returns SenderStatusNone from
LoadSavedServerSigningCert (file-not-found), which clobbers the
original SenderStatusCertMiss produced by the primary attempt.
Callers in zedagent then fall through to the default case instead
of case types.SenderStatusCertMiss:, so triggerControllerCertEvent
never fires, EVE never fetches the rotated cert, and the test ends up
timing out waiting for the rebuild log. See
run 25392481670
for an example failure.

The fix is on the EVE side and is in flight separately. With both
rotations in place, this test will:

  • exercise the cert-fetch trigger path (rotation 1) and
  • exercise the backup-cert-load path (rotation 2)

so that any future regression that breaks either branch fails the
test rather than silently passing.

Test plan

  • Smoke (ext4, true) on EVE master with the authen.go
    SenderStatusCertMiss-preservation fix applied — both rotations
    should complete inside the 15m windows.
  • Smoke (ext4, true) on EVE master without the fix — rotation 1
    should still time out (regression confirmation).
  • Smoke (zfs, true) and Smoke (*, false) for parity with the
    existing matrix entries.
  • Confirm in the captured device logs that
    MaybeSaveControllerCerts updated the certs appears after each
    rotation and that controllercerts.bak worked appears on the
    second rotation (or, equivalently, that the controllercerts.bak
    file size is non-zero after the run).

Notes for reviewers

  • gen-signing-cert always reads signing.pem/signing-key.pem from
    edenHome and only varies NotBefore/NotAfter
    (pkg/utils/x509.go:GenServerCertFromPrevCertAndKey). Two
    back-to-back invocations produce two distinct certs because the
    timestamps differ, so the second change-signing-cert is a real
    rotation, not a no-op.
  • check_sign_cert.sh is unchanged. It compares EVE's installed
    cert against /tmp/signing-new.pem, which after the second
    rotation holds the second new cert (overwritten by the second
    gen-signing-cert -o /tmp/signing-new.pem), so the existing check
    still validates the correct end state.
  • Total test runtime grows by up to one extra 15m TestLog window in
    the worst case. The window can be tightened later if warranted.

🤖 Generated with Claude Code

Extend the ctrl_cert_change test to perform two consecutive controller
signing certificate rotations between the initial eclient1 deploy and
the eclient2 deploy. The first rotation runs while the device has no
/persist/checkpoint/controllercerts.bak, so EVE's auth-verify fallback
in VerifyAuthContainerHeader cannot use the backup file and must reach
SenderStatusCertMiss to trigger a /certs fetch. The second rotation
runs after MaybeSaveControllerCerts has populated .bak, so the
backup-load path is actually exercised before the fetch is triggered,
and MaybeSaveControllerCerts also has to refresh an existing .bak in
place.

This catches a regression class where the cert-miss status returned by
the first verify is silently overwritten by the .bak fallback's status,
which caused the test to time out waiting for the "updating app
connectivity" log when .bak did not exist.

Each rotation waits for the rebuild log produced by zedrouter's
UpdateAppConn so that a second TestLog invocation only matches a log
emitted after the second rotation (LogChecker is started in LogNew
mode by the test bus).

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant