fix(cli): retry upload-complete and cancel orphaned server deploys#15371
fix(cli): retry upload-complete and cancel orphaned server deploys#15371abhiaiyer91 merged 9 commits intomainfrom
Conversation
The upload-complete step (Step 3) in server deploy had no retry logic. Transient failures (network errors, 5xx, 401 token expiry) would leave deploys permanently stuck in 'queued' status with no cleanup path. - Retry upload-complete up to 3 times with exponential backoff (1s, 2s, 4s) - Refresh auth token on 401 before retrying - Cancel orphaned deploys when upload or confirmation fails after retries - Add user-visible log messages during retries and cleanup Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🦋 Changeset detectedLatest commit: cbb4304 The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a changeset and updates CLI deploy flows to retry transient upload/confirmation failures (including token refresh on 401), best-effort cancel created deploys on failures, moves to API-client POST calls for studio deploys, and extends OpenAPI typings (cancel endpoint, Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Use isRetryablePollingError from shared polling utils to catch network failures (ECONNRESET, ETIMEDOUT, fetch failed, etc.) that throw exceptions rather than returning HTTP error responses. Previously these would bypass the retry loop entirely. Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.changeset/whole-places-beam.md:
- Line 5: The changeset message is too implementation-heavy; edit the
.changeset/whole-places-beam.md entry to a one- or two-sentence, user-facing
release note that highlights the outcome (e.g., "Fixes server deploys getting
stuck in 'queued' status by improving retry and cleanup behavior so failed
uploads no longer block deployments") and remove technical details about status
codes, retry counts, exponential backoff, and internal cleanup mechanics so it
reads like a concise changelog entry.
In `@packages/cli/src/commands/server/platform-api.ts`:
- Around line 98-106: The cancelDeploy helper currently assumes
deployClient.POST throws on failure but it returns an object; update
cancelDeploy (function cancelDeploy) to inspect the returned result from
deployClient.POST (the { error, response } shape), and treat non-2xx or a
present error as a failure: log a warning including the error/response details
and either throw or return a failure signal so callers know cancellation didn’t
succeed; keep the existing try/catch but after awaiting deployClient.POST check
the returned error or response.status and handle/log appropriately instead of
assuming success.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: be02f8d7-e3ae-436f-8b04-5d6cd5b5fbab
📒 Files selected for processing (2)
.changeset/whole-places-beam.mdpackages/cli/src/commands/server/platform-api.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/cli/src/commands/server/platform-api.ts`:
- Around line 171-174: The token refresh (getToken) can reject and currently
runs outside the request try/catch, so wrap the refresh+recreate logic in its
own try/catch inside the existing request retry path: call getToken() and
createApiClient(...) inside a try block; on failure catch the error and treat it
as a terminal retry outcome by invoking the same cleanup/cancel routine used for
other terminal failures (the cancellation logic used later in the function) or
rethrow so the outer try/catch executes cleanup. Refer to getToken,
createApiClient, and currentClient to locate the refresh logic and ensure the
cleanup/cancel path always runs when token refresh fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f2a8df73-c751-47e4-aabd-879c23473f72
📒 Files selected for processing (1)
packages/cli/src/commands/server/platform-api.ts
- Check cancelDeploy return value instead of assuming POST throws on error - Wrap token refresh (getToken) in try-catch so failures trigger deploy cleanup - Simplify changeset to user-facing release note Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/cli/src/commands/platform-api.d.ts (1)
2933-2935: Narrow successful cancel responses tocancelled.A 200 from either cancel endpoint should not need to advertise arbitrary deploy states. Keeping
statusas the full union / plainstringweakens exhaustiveness checks and forces downstream code to handle states that should be impossible after a successful cancel. Please tighten the OpenAPI schema so the success payload returnsstatus: 'cancelled'or the smallest accurate union.As per coding guidelines,
**/*.{ts,tsx}: All packages use TypeScript with strict type checking.Also applies to: 3920-3921
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/commands/platform-api.d.ts` around lines 2933 - 2935, The response schema currently types `status` as a wide union/string; change it to the exact cancelled literal so successful cancel endpoints only return status: 'cancelled' (not the full union or plain string). Locate the schema block that contains the lines with `id: string;` and `status:` (and the duplicate at the other occurrence around the noted second location) and replace the union with the single string literal `'cancelled'`, and ensure any generated types/clients are regenerated or adjusted to satisfy TypeScript strict checking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/cli/src/commands/platform-api.d.ts`:
- Line 332: Update the OpenAPI source schema to include the smaller time
granularities (1m, 5m, 15m) for the usage endpoint, then regenerate the
TypeScript client so the declaration for getV1GatewayProjectsByIdUsageActivity
(in packages/cli/src/commands/platform-api.d.ts) exposes granularity values '1m'
| '5m' | '15m' | '1h' | 'daily' | 'weekly' | 'monthly' and corresponding query
params; specifically, modify the schema's parameter enum for granularity to
include the three missing values, run the codegen that produces
platform-api.d.ts, and verify the generated function signature and query param
types match the docstring.
- Line 3362: The generated type declaration incorrectly sets "requestBody?:
never;" for the env-update operations, removing the payload type and breaking
strict-TS callers; revert this by restoring the requestBody schema for those
operations in the OpenAPI spec (do not edit the .d.ts directly), regenerate the
client so the generated declarations include the proper request body types, and
ensure the OpenAPI operation(s) that currently produce "requestBody?: never;"
are updated to reference the correct env-update payload schema (search for
occurrences of "requestBody?: never;" around the reported spots and fix the
source OpenAPI operation definitions before regenerating).
---
Nitpick comments:
In `@packages/cli/src/commands/platform-api.d.ts`:
- Around line 2933-2935: The response schema currently types `status` as a wide
union/string; change it to the exact cancelled literal so successful cancel
endpoints only return status: 'cancelled' (not the full union or plain string).
Locate the schema block that contains the lines with `id: string;` and `status:`
(and the duplicate at the other occurrence around the noted second location) and
replace the union with the single string literal `'cancelled'`, and ensure any
generated types/clients are regenerated or adjusted to satisfy TypeScript strict
checking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2d8ea14a-2be5-49c5-875f-74af642e4758
📒 Files selected for processing (1)
packages/cli/src/commands/platform-api.d.ts
| /** | ||
| * Get usage activity | ||
| * @description Aggregated usage metrics for a project. Supports multiple granularities (1h, daily, weekly, monthly) and optional group-by dimensions (model, provider, source_category, api_key_id). | ||
| * @description Aggregated usage metrics for a project. Supports multiple granularities (1m, 5m, 15m, 1h, daily, weekly, monthly) and optional group-by dimensions (model, provider, source_category, api_key_id). |
There was a problem hiding this comment.
Sync the documented granularities with the generated contract.
This description now promises 1m, 5m, and 15m, but the generated type for getV1GatewayProjectsByIdUsageActivity later in this file still narrows granularity to 'daily' | 'weekly' | 'monthly' | '1h' and does not expose matching query params. That leaves the CLI contract self-contradictory for strict TS consumers. Please fix the source OpenAPI schema and regenerate this file.
As per coding guidelines, packages/cli/**: Preserve stable CLI behavior in the packages/cli/ directory, and **/*.{ts,tsx}: All packages use TypeScript with strict type checking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/cli/src/commands/platform-api.d.ts` at line 332, Update the OpenAPI
source schema to include the smaller time granularities (1m, 5m, 15m) for the
usage endpoint, then regenerate the TypeScript client so the declaration for
getV1GatewayProjectsByIdUsageActivity (in
packages/cli/src/commands/platform-api.d.ts) exposes granularity values '1m' |
'5m' | '15m' | '1h' | 'daily' | 'weekly' | 'monthly' and corresponding query
params; specifically, modify the schema's parameter enum for granularity to
include the three missing values, run the codegen that produces
platform-api.d.ts, and verify the generated function signature and query param
types match the docstring.
Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
…ncel - Replace platformFetch with createApiClient for create deploy and upload-complete - Add retry logic (3 retries, exponential backoff) for upload-complete step - Add cancel-on-failure cleanup for both artifact upload and confirmation failures - Handle 401 with token refresh, network errors with isRetryablePollingError - Regenerate platform-api.d.ts with fixed studio + server env requestBody types - Update tests to mock typed client instead of raw fetch Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
packages/cli/src/commands/studio/platform-api.test.ts (1)
120-198: Consider adding test coverage for upload-complete retry scenarios.The implementation includes retry logic for upload-complete failures (5xx, 401 with token refresh, network errors), but the test suite doesn't cover these paths. While not blocking, adding tests for:
- Successful retry after transient 5xx failure
- Token refresh on 401 before retry
- Cancel deploy after retry exhaustion
would improve confidence in the retry behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/commands/studio/platform-api.test.ts` around lines 120 - 198, Add unit tests for uploadDeploy to cover the upload-complete retry paths: write one test where the initial POST to '/v1/studio/deploys/{id}/upload-complete' returns a transient 5xx and the subsequent retry succeeds (assert two POST calls and final success), another where the first upload-complete returns 401 and the code performs token refresh before retrying (stub the refresh flow and assert refresh was called and upload-complete retried), and a test where upload-complete keeps failing until retries exhaust and uploadDeploy then calls the cancel-deploy POST (assert cancel POST was invoked and the function throws); target the uploadDeploy function and reuse/mock createApiClient().POST (mockPOST) and global fetch as in existing tests to simulate responses.packages/cli/src/commands/studio/platform-api.ts (1)
127-127: Minor: Redundant Buffer conversion.
zipBufferis already aBuffer, soBuffer.from(zipBuffer)creates an unnecessary copy.writeFileacceptsBufferdirectly.🔧 Suggested fix
- await writeFile(fileURLToPath(uploadUrl), Buffer.from(zipBuffer)); + await writeFile(fileURLToPath(uploadUrl), zipBuffer);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/commands/studio/platform-api.ts` at line 127, The code writes an already-buffered payload using Buffer.from which makes an unnecessary copy; update the call that writes the file (writeFile(fileURLToPath(uploadUrl), ...)) to pass zipBuffer directly instead of Buffer.from(zipBuffer) so writeFile receives the original Buffer (use zipBuffer) and remove the redundant Buffer.from allocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/cli/src/commands/studio/platform-api.test.ts`:
- Around line 120-198: Add unit tests for uploadDeploy to cover the
upload-complete retry paths: write one test where the initial POST to
'/v1/studio/deploys/{id}/upload-complete' returns a transient 5xx and the
subsequent retry succeeds (assert two POST calls and final success), another
where the first upload-complete returns 401 and the code performs token refresh
before retrying (stub the refresh flow and assert refresh was called and
upload-complete retried), and a test where upload-complete keeps failing until
retries exhaust and uploadDeploy then calls the cancel-deploy POST (assert
cancel POST was invoked and the function throws); target the uploadDeploy
function and reuse/mock createApiClient().POST (mockPOST) and global fetch as in
existing tests to simulate responses.
In `@packages/cli/src/commands/studio/platform-api.ts`:
- Line 127: The code writes an already-buffered payload using Buffer.from which
makes an unnecessary copy; update the call that writes the file
(writeFile(fileURLToPath(uploadUrl), ...)) to pass zipBuffer directly instead of
Buffer.from(zipBuffer) so writeFile receives the original Buffer (use zipBuffer)
and remove the redundant Buffer.from allocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b27766d3-50e5-41ba-8314-a6e03131b376
📒 Files selected for processing (3)
packages/cli/src/commands/platform-api.d.tspackages/cli/src/commands/studio/platform-api.test.tspackages/cli/src/commands/studio/platform-api.ts
…io poll 401 - Extract confirmUploadWithRetry and bestEffortCancel to utils/deploy-upload.ts - Refactor uploadServerDeploy and uploadDeploy to use shared helpers - Fix pollDeploy (studio) missing 401 token refresh — now matches pollServerDeploy - Callers keep endpoint-specific logic (create deploy, response extraction) - Shared code handles retry loop, backoff, 401 refresh, cancel-on-exhaustion Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
Cover the root failure modes: - bestEffortCancel swallows API and network errors - confirmUploadWithRetry retries 5xx and network errors - 401 triggers token refresh with fresh client - Non-retryable 4xx fails immediately with cancel - Retry exhaustion cancels orphaned deploy - Token refresh failure cancels and throws Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
…n studio uploadDeploy Co-Authored-By: Mastra Code (anthropic/claude-opus-4-6) <noreply@mastra.ai>
89ecacb to
cbb4304
Compare
This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and publish to npm yourself or [setup this action to publish automatically](https://github.com/changesets/action#with-publishing). If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ `main` is currently in **pre mode** so this branch has prereleases rather than normal releases. If you want to exit prereleases, run `changeset pre exit` on `main`.⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ # Releases ## mastracode@0.14.0-alpha.5 ### Patch Changes - Updated dependencies \[[`5cf84a3`](5cf84a3)]: - @mastra/mcp@1.5.0-alpha.1 ## @mastra/arize@1.0.17-alpha.3 ### Patch Changes - dependencies updates: ([#15282](#15282)) - Updated dependency [`@arizeai/openinference-genai@0.1.7` ↗︎](https://www.npmjs.com/package/@arizeai/openinference-genai/v/0.1.7) (from `0.1.6`, in `dependencies`) ## @mastra/arthur@0.2.3-alpha.3 ### Patch Changes - dependencies updates: ([#15282](#15282)) - Updated dependency [`@arizeai/openinference-genai@0.1.7` ↗︎](https://www.npmjs.com/package/@arizeai/openinference-genai/v/0.1.7) (from `0.1.6`, in `dependencies`) ## mastra@1.6.0-alpha.5 ### Patch Changes - Fixed `mastra server deploy` getting stuck in `queued` after transient upload confirmation failures. The CLI now retries these failures and cleans up failed deploys so they no longer remain orphaned. ([#15371](#15371)) ## @mastra/mcp@1.5.0-alpha.1 ### Patch Changes - Preserve forwarded MCP client elicitation capabilities so client-supported URL and form elicitations work correctly. ([#15233](#15233)) ## create-mastra@1.6.0-alpha.5 ## @internal/playground@1.6.0-alpha.5 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Server deploys could get permanently stuck in 'queued' status if the upload-complete call (Step 3 of
uploadServerDeploy) hit a transient failure — network error, 5xx, or expired token. There was no retry and no cleanup, so the deploy just sat there forever. A customer hit this twice in a row.Now the CLI retries upload-complete up to 3 times with exponential backoff (1s, 2s, 4s), refreshes the auth token on 401, and cancels the orphaned deploy if retries are exhausted or if artifact upload fails. Also added console.warn messages so the user can actually see what's happening during retries and cleanup instead of staring at a silent spinner.
ELI5
If the final step of uploading a deploy fails (network hiccup or expired login), the deployment could stay stuck. This PR makes the CLI retry the confirmation step, refresh the login if needed, and cancel failed/orphaned deploys so nothing remains stuck.
Changes
Changeset
queuedwhen upload confirmation fails. Notes retries, user-visible retry/cleanup logs, and automatic cancellation of orphaned deploys.CLI Platform API (server & studio)
Types / API contract updates
No exported/public function signatures were intentionally changed.