feat(metrics): add operation counters, deletion duration, resume type label, and abnormal state detection#336
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #336 +/- ##
==========================================
+ Coverage 74.65% 75.43% +0.77%
==========================================
Files 141 142 +1
Lines 9836 9907 +71
==========================================
+ Hits 7343 7473 +130
+ Misses 2183 2127 -56
+ Partials 310 307 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ation_seconds
Remove high-cardinality labels (resume_reason, name) from the
sandbox_resume_duration_seconds histogram to prevent label explosion.
Only the namespace dimension is retained alongside the source=k8s
const label.
- Update metric definition to use []string{"namespace"} only
- Simplify recording logic by removing resumeReason extraction
- Remove deleteSandboxMetrics cleanup for this histogram since
namespace-level metrics should not be deleted per-sandbox
- Remove TestSandboxResumeDurationWithReason and related helpers
- Update TestSandboxResumeDuration to match new label signature
…prevent cardinality explosion Remove the high-cardinality 'name' label from all Histogram and Counter metrics across controller, manager, and e2b layers. Only namespace dimension is retained for these metric types to prevent time series explosion in production environments. Affected metrics: - Controller: sandbox_creation/pause/resume/deletion_duration_seconds, sandbox_creation/pause/resume_total - SandboxSet: sandboxset_sandboxes_created/claimed_total - SandboxClaim: sandboxset_claims_total, sandboxclaim_expired_total - Manager: all duration histograms, response/total counters, route sync metrics - E2B: sandbox_snapshot_duration_seconds, sandbox_snapshot_total Preserved with name label: - sandbox_status_abnormal (Gauge) - needs per-sandbox granularity for stuck state detection Also fix test isolation by using unique namespaces per test case to prevent cross-test interference in namespace-level histograms.
| // Requirement: Track failure in API layer | ||
| sandboxClaimCreationResponses.WithLabelValues("failure").Inc() | ||
| sandboxClaimTotal.WithLabelValues("failure", "unknown").Inc() | ||
| sandboxClaimCreationResponses.WithLabelValues("", "failure").Inc() |
There was a problem hiding this comment.
we can get namespace from the template or Checkpoint, consider refactor HasTemplate/HasCheckpoint to GetTemplate/GetCheckpoint and return the template/checkpoint object. Consider modify CreateSandbox to save the namespace in the context
| ConstLabels: prometheus.Labels{"source": "k8s"}, | ||
| Buckets: prometheus.ExponentialBuckets(0.02, 2, 12), // 20ms -> ~41s | ||
| }, | ||
| []string{"namespace", "name"}, |
| } | ||
|
|
||
| // DeleteSandboxClaimCounterMetrics is a no-op since counter metrics are at namespace level. | ||
| func DeleteSandboxClaimCounterMetrics(namespace, name string) { |
| } | ||
| } | ||
|
|
||
| sandboxStatusAbnormal.WithLabelValues(namespace, name, "pause_incomplete").Set(pauseAbnormal) |
There was a problem hiding this comment.
can we delete sandboxStatusAbnormal metrics if sandbox becomes healthy again. i think we should only keep abnormal metrics
| Help: "Cumulative number of Sandboxes claimed from the SandboxSet", | ||
| ConstLabels: prometheus.Labels{"source": "k8s"}, | ||
| }, | ||
| []string{"namespace"}, |
There was a problem hiding this comment.
for sandboxSetSandboxesCreatedTotal and sandboxSetSandboxesClaimedTotal, we can add sandboxset name as label, the number of sandboxset is much less than sandbox and sandboxclaims
|
|
||
| var ( | ||
| // sandboxSetClaimsTotal tracks the cumulative number of times a SandboxSet has been claimed. | ||
| sandboxSetClaimsTotal = prometheus.NewCounterVec( |
There was a problem hiding this comment.
is it possible to merge this file to pkg/sandboxlcaim/metrics.go?
…point Refactor Infrastructure interface to return full CR objects instead of booleans, enabling proper namespace extraction for Prometheus metric labels. Changes: - HasTemplate(ctx, name) bool -> GetTemplate(ctx, name) (*v1alpha1.SandboxSet, error) - HasCheckpoint(ctx, name) bool -> GetCheckpoint(ctx, name) (*v1alpha1.Checkpoint, error) - ClaimSandbox: use template.GetNamespace() for metric namespace label - CloneSandbox: attempt GetCheckpoint to extract namespace on failure - Update E2B CreateSandbox and all related tests/mocks accordingly
|
@KeyOfSpectator: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does
Enhance Prometheus metrics across Controller and Manager layers with operation counters, deletion latency tracking, phase/condition mismatch detection, and cardinality-optimized label design to prevent time series explosion in production.
Changes
Label Cardinality Strategy
All Histogram and Counter metrics use only
namespaceas the dynamic label (no per-sandboxname) to prevent cardinality explosion. Only Gauge metrics that require per-sandbox granularity retain thenamelabel.Controller Layer — New Metrics
sandbox_creation_totalsandbox_pause_totalsandbox_resume_totalsandbox_deletion_duration_secondssandbox_status_abnormalController Layer — Existing Metrics Enhanced
sandbox_creation_duration_secondssandbox_inplace_update_duration_secondssandbox_pause_duration_secondssandbox_resume_duration_secondsSandboxSet Layer
sandboxset_sandboxes_created_totalsandboxset_sandboxes_claimed_totalSandboxClaim Layer
sandboxset_claims_totalsandboxclaim_expired_totalManager Layer
All metrics upgraded from plain Histogram/Counter to Vec with
namespacelabel:sandbox_resume/delete/claim/clone_duration_seconds,sandbox_claim_retriessandbox_claim_creation_responses,sandbox_claim_total,sandbox_clone_total,sandbox_pause/resume/delete_responsessandbox_route_sync_duration_seconds,sandbox_route_sync_totalE2B Layer
sandbox_snapshot_duration_seconds— Histogram with namespace labelsandbox_snapshot_total— Counter with namespace, result labelsKey Design Decisions
recordConditionDuration()(optionalcounterparameter)time.Since(DeletionTimestamp)pause_incompleteandresume_incompletetypessourcedistinguishes controller (k8s) vs manager (e2b) originHow to verify
go test ./pkg/controller/sandbox/ ./pkg/controller/sandboxset/ ./pkg/controller/sandboxclaim/... ./pkg/sandbox-manager/ -count=1 -timeout 120s