Skip to content

feat(metrics): add operation counters, deletion duration, resume type label, and abnormal state detection#336

Open
KeyOfSpectator wants to merge 5 commits intoopenkruise:masterfrom
KeyOfSpectator:features/refine-prometheus-metric-20260506
Open

feat(metrics): add operation counters, deletion duration, resume type label, and abnormal state detection#336
KeyOfSpectator wants to merge 5 commits intoopenkruise:masterfrom
KeyOfSpectator:features/refine-prometheus-metric-20260506

Conversation

@KeyOfSpectator
Copy link
Copy Markdown
Contributor

@KeyOfSpectator KeyOfSpectator commented May 6, 2026

What this PR does

Enhance Prometheus metrics across Controller and Manager layers with operation counters, deletion latency tracking, phase/condition mismatch detection, and cardinality-optimized label design to prevent time series explosion in production.

Changes

Label Cardinality Strategy

All Histogram and Counter metrics use only namespace as the dynamic label (no per-sandbox name) to prevent cardinality explosion. Only Gauge metrics that require per-sandbox granularity retain the name label.

Controller Layer — New Metrics

Metric Type Labels Purpose
sandbox_creation_total Counter namespace, result Track creation success/failure rate
sandbox_pause_total Counter namespace, result Track pause operation completions
sandbox_resume_total Counter namespace, result Track resume operation completions
sandbox_deletion_duration_seconds Histogram namespace Measure deletion latency distribution
sandbox_status_abnormal Gauge namespace, name, type Detect phase/condition stuck states

Controller Layer — Existing Metrics Enhanced

Metric Change
sandbox_creation_duration_seconds Histogram → HistogramVec (namespace only)
sandbox_inplace_update_duration_seconds Histogram → HistogramVec (namespace only)
sandbox_pause_duration_seconds Histogram → HistogramVec (namespace only)
sandbox_resume_duration_seconds Histogram → HistogramVec (namespace only)

SandboxSet Layer

Metric Type Labels Purpose
sandboxset_sandboxes_created_total Counter namespace Track sandbox creation by set
sandboxset_sandboxes_claimed_total Counter namespace Track consumption speed

SandboxClaim Layer

Metric Type Labels Purpose
sandboxset_claims_total Counter namespace, result Track claim operations
sandboxclaim_expired_total Counter namespace Track TTL expiration deletions

Manager Layer

All metrics upgraded from plain Histogram/Counter to Vec with namespace label:

  • Duration histograms: sandbox_resume/delete/claim/clone_duration_seconds, sandbox_claim_retries
  • Counters: sandbox_claim_creation_responses, sandbox_claim_total, sandbox_clone_total, sandbox_pause/resume/delete_responses
  • Route sync: sandbox_route_sync_duration_seconds, sandbox_route_sync_total

E2B Layer

  • sandbox_snapshot_duration_seconds — Histogram with namespace label
  • sandbox_snapshot_total — Counter with namespace, result labels

Key Design Decisions

  • Cardinality optimization: Histogram and Counter use namespace-only to avoid O(n) series per sandbox. Gauge retains name for per-instance stuck detection.
  • Operation counters co-located with duration observations via extended recordConditionDuration() (optional counter parameter)
  • Deletion duration observed at cleanup time using time.Since(DeletionTimestamp)
  • Abnormal state detection covers pause_incomplete and resume_incomplete types
  • ConstLabel source distinguishes controller (k8s) vs manager (e2b) origin

How to verify

go test ./pkg/controller/sandbox/ ./pkg/controller/sandboxset/ ./pkg/controller/sandboxclaim/... ./pkg/sandbox-manager/ -count=1 -timeout 120s

@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign furykerry for approval by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 98.86364% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 75.43%. Comparing base (edaad73) to head (6010f41).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
pkg/sandbox-manager/api.go 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #336      +/-   ##
==========================================
+ Coverage   74.65%   75.43%   +0.77%     
==========================================
  Files         141      142       +1     
  Lines        9836     9907      +71     
==========================================
+ Hits         7343     7473     +130     
+ Misses       2183     2127      -56     
+ Partials      310      307       -3     
Flag Coverage Δ
unittests 75.43% <98.86%> (+0.77%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/controller/sandbox/metrics.go Outdated
…ation_seconds

Remove high-cardinality labels (resume_reason, name) from the
sandbox_resume_duration_seconds histogram to prevent label explosion.
Only the namespace dimension is retained alongside the source=k8s
const label.

- Update metric definition to use []string{"namespace"} only
- Simplify recording logic by removing resumeReason extraction
- Remove deleteSandboxMetrics cleanup for this histogram since
  namespace-level metrics should not be deleted per-sandbox
- Remove TestSandboxResumeDurationWithReason and related helpers
- Update TestSandboxResumeDuration to match new label signature
…prevent cardinality explosion

Remove the high-cardinality 'name' label from all Histogram and Counter
metrics across controller, manager, and e2b layers. Only namespace
dimension is retained for these metric types to prevent time series
explosion in production environments.

Affected metrics:
- Controller: sandbox_creation/pause/resume/deletion_duration_seconds,
  sandbox_creation/pause/resume_total
- SandboxSet: sandboxset_sandboxes_created/claimed_total
- SandboxClaim: sandboxset_claims_total, sandboxclaim_expired_total
- Manager: all duration histograms, response/total counters,
  route sync metrics
- E2B: sandbox_snapshot_duration_seconds, sandbox_snapshot_total

Preserved with name label:
- sandbox_status_abnormal (Gauge) - needs per-sandbox granularity
  for stuck state detection

Also fix test isolation by using unique namespaces per test case
to prevent cross-test interference in namespace-level histograms.
// Requirement: Track failure in API layer
sandboxClaimCreationResponses.WithLabelValues("failure").Inc()
sandboxClaimTotal.WithLabelValues("failure", "unknown").Inc()
sandboxClaimCreationResponses.WithLabelValues("", "failure").Inc()
Copy link
Copy Markdown
Member

@furykerry furykerry May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can get namespace from the template or Checkpoint, consider refactor HasTemplate/HasCheckpoint to GetTemplate/GetCheckpoint and return the template/checkpoint object. Consider modify CreateSandbox to save the namespace in the context

ConstLabels: prometheus.Labels{"source": "k8s"},
Buckets: prometheus.ExponentialBuckets(0.02, 2, 12), // 20ms -> ~41s
},
[]string{"namespace", "name"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz also remove name label

}

// DeleteSandboxClaimCounterMetrics is a no-op since counter metrics are at namespace level.
func DeleteSandboxClaimCounterMetrics(namespace, name string) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove no-ops func?

}
}

sandboxStatusAbnormal.WithLabelValues(namespace, name, "pause_incomplete").Set(pauseAbnormal)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we delete sandboxStatusAbnormal metrics if sandbox becomes healthy again. i think we should only keep abnormal metrics

Help: "Cumulative number of Sandboxes claimed from the SandboxSet",
ConstLabels: prometheus.Labels{"source": "k8s"},
},
[]string{"namespace"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sandboxSetSandboxesCreatedTotal and sandboxSetSandboxesClaimedTotal, we can add sandboxset name as label, the number of sandboxset is much less than sandbox and sandboxclaims


var (
// sandboxSetClaimsTotal tracks the cumulative number of times a SandboxSet has been claimed.
sandboxSetClaimsTotal = prometheus.NewCounterVec(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to merge this file to pkg/sandboxlcaim/metrics.go?

…point

Refactor Infrastructure interface to return full CR objects instead of
booleans, enabling proper namespace extraction for Prometheus metric
labels.

Changes:
- HasTemplate(ctx, name) bool -> GetTemplate(ctx, name) (*v1alpha1.SandboxSet, error)
- HasCheckpoint(ctx, name) bool -> GetCheckpoint(ctx, name) (*v1alpha1.Checkpoint, error)
- ClaimSandbox: use template.GetNamespace() for metric namespace label
- CloneSandbox: attempt GetCheckpoint to extract namespace on failure
- Update E2B CreateSandbox and all related tests/mocks accordingly
@kruise-bot
Copy link
Copy Markdown

@KeyOfSpectator: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants