This guide documents common deployment issues and how to debug them using the Control Plane MCP (Model Context Protocol) tools.
- Site returning 503 errors
- Container in crash loop (134+ restarts)
- Exit code 1 with no obvious error message in Control Plane UI
Two issues were identified:
Problem: Capacity AI reduced memory from 512Mi to 32Mi, which is insufficient for a Rails application to boot.
How it happened:
- Container crashed during startup (due to issue #2 below)
- Capacity AI observed low memory usage before crash
- Capacity AI reduced memory allocation
- Smaller allocation caused faster crashes
- Feedback loop continued until memory was reduced to 32Mi
Fix:
# Disable Capacity AI and reset memory via MCP tools:
mcp__cpln__update_workload(
gvc="react-webpack-rails-tutorial-staging",
name="rails",
capacityAI=false,
memory="512Mi",
cpu="300m"
)Problem: Rails 8.1+ requires SECRET_KEY_BASE at runtime. Previous versions were more lenient.
Error in logs:
Missing `secret_key_base` for 'production' environment, set this string with `bin/rails credentials:edit` (ArgumentError)
Fix:
# Add SECRET_KEY_BASE to the GVC environment:
# Generate key: openssl rand -hex 64
mcp__cpln__cpln_resource_operation(
kind="gvc",
operation="patch",
name="react-webpack-rails-tutorial-staging",
body={
"spec": {
"env": [
# ... existing env vars ...
{"name": "SECRET_KEY_BASE", "value": "<generated-key>"}
]
}
}
)The Control Plane MCP provides tools for investigating deployment issues without needing direct CLI access.
mcp__cpln__set_context(
org="shakacode-open-source-examples-staging",
defaultGvc="react-webpack-rails-tutorial-staging"
)# List all workloads
mcp__cpln__list_workloads(gvc="react-webpack-rails-tutorial-staging")
# Get deployment status for a specific workload
mcp__cpln__get_workload_deployments(
gvc="react-webpack-rails-tutorial-staging",
name="rails"
)Key fields to check:
status.ready- Is the deployment healthy?status.versions[].containers.rails.restarts- How many restarts?status.versions[].containers.rails.resources.memory- Current memory allocationstatus.message- Error messages
mcp__cpln__get_workload_events(
gvc="react-webpack-rails-tutorial-staging",
name="rails"
)Events show Capacity AI changes, errors, and other significant occurrences.
Use cpflow CLI to get actual container logs:
cpflow logs -a react-webpack-rails-tutorial-stagingThis reveals application-level errors like missing environment variables.
mcp__cpln__cpln_resource_operation(
kind="gvc",
operation="get",
name="react-webpack-rails-tutorial-staging"
)Verify all required environment variables are present in spec.env.
Verify postgres and redis are healthy:
mcp__cpln__get_workload_deployments(gvc="...", name="postgres")
mcp__cpln__get_workload_deployments(gvc="...", name="redis")Symptoms: Crash loop, Capacity AI events showing decreasing memory/CPU
Fix:
mcp__cpln__update_workload(
gvc="...",
name="rails",
capacityAI=false, # Disable Capacity AI
memory="512Mi", # Set appropriate memory
cpu="300m" # Set appropriate CPU
)Symptoms: Exit code 1, application errors in logs
Fix: Add missing variables to GVC env via cpln_resource_operation patch.
Symptoms: Connection refused errors in logs
Check:
- Postgres workload is healthy
- DATABASE_URL is correct in GVC env
- Firewall rules allow internal traffic
Symptoms: ImagePullBackOff status
Check:
mcp__cpln__list_images(org="...")
mcp__cpln__get_image(org="...", name="<image-name>")- Set minimum resource limits - Don't let Capacity AI reduce below safe thresholds
- Test environment variables - Ensure all required env vars are present before deploying
- Monitor deployments - Use MCP tools to check deployment health after releases
- Keep logs accessible - Use
cpflow logsto diagnose application errors
| Tool | Purpose |
|---|---|
set_context |
Set default org/GVC for session |
list_workloads |
List all workloads in a GVC |
get_workload |
Get workload configuration |
get_workload_deployments |
Get deployment status and health |
get_workload_events |
Get event log for debugging |
update_workload |
Update workload settings |
list_secrets |
List available secrets |
cpln_resource_operation |
Generic CRUD for any resource |
CRITICAL: Before promoting staging to production, ensure both environments have matching configurations.
- Environment Variables Match - All required env vars in staging GVC must also exist in production GVC
- Secrets Configured - Any secrets referenced by workloads exist in production org
- Resource Limits Set - Production workloads have appropriate CPU/memory (not relying on Capacity AI)
- Database Migrations Safe - Release script migrations are backwards-compatible
Incident (Feb 2026): Staging was fixed by adding SECRET_KEY_BASE to the GVC. When promoted to production, the image worked but production GVC was missing SECRET_KEY_BASE, causing immediate crash.
Prevention:
- When adding env vars to staging, immediately add to production (even before promotion)
- Use a checklist or automation to sync GVC env vars between environments
- Consider storing shared secrets in a central location referenced by both GVCs
Recovery:
# Generate production secret (use different value than staging!)
export PROD_SECRET_KEY=$(openssl rand -hex 64)
# Add to production GVC
cpln gvc patch react-webpack-rails-tutorial-production \
--org shakacode-open-source-examples-production \
--set "spec.env[+].name=SECRET_KEY_BASE" \
--set "spec.env[-1].value=$PROD_SECRET_KEY"- Fix staging - Diagnose and fix the issue
- Document the fix - Note any configuration changes made
- Apply to production FIRST - Add any new env vars/secrets to production before promotion
- Then promote - Run the promotion workflow
- Verify production - Check deployment health after promotion
The promotion workflow includes automatic rollback protection. If a deployment fails health checks, it will automatically restore the previous working image.
1. Capture current production image → Save for potential rollback
2. Copy new image from staging → Get the new version
3. Deploy to production → Run release phase (migrations)
4. Health check (12 retries) → Verify deployment is responding
5a. If healthy → Create GitHub release
5b. If unhealthy → Rollback to previous image
Automatic rollback occurs when:
- Health check fails after 12 attempts (2 minutes total)
- HTTP endpoint returns non-2xx/3xx status
- Deployment times out
If you need to manually rollback:
# Get the previous image from GitHub Actions logs or:
cpln workload get rails \
--gvc react-webpack-rails-tutorial-production \
--org shakacode-open-source-examples-production \
-o json | jq '.version'
# List available images
cpln image list --org shakacode-open-source-examples-production
# Rollback to specific image
cpln workload update rails \
--gvc react-webpack-rails-tutorial-production \
--org shakacode-open-source-examples-production \
--set spec.containers[0].image="/org/shakacode-open-source-examples-production/image/react-webpack-rails-tutorial-production:<tag>"# Get current workload to see image history
mcp__cpln__get_workload(gvc="...", name="rails")
# Update to previous image
mcp__cpln__update_workload(
gvc="react-webpack-rails-tutorial-production",
name="rails",
image="/org/.../image/react-webpack-rails-tutorial-production:<previous-tag>"
)Consider adding a pre-promotion check to the GitHub Action:
- name: Verify Production Environment
run: |
# Compare staging and production GVC env vars
STAGING_VARS=$(cpln gvc get react-webpack-rails-tutorial-staging \
--org shakacode-open-source-examples-staging -o json | jq -r '.spec.env[].name' | sort)
PROD_VARS=$(cpln gvc get react-webpack-rails-tutorial-production \
--org shakacode-open-source-examples-production -o json | jq -r '.spec.env[].name' | sort)
MISSING=$(comm -23 <(echo "$STAGING_VARS") <(echo "$PROD_VARS"))
if [ -n "$MISSING" ]; then
echo "ERROR: Production is missing these env vars from staging:"
echo "$MISSING"
exit 1
fiBy default, the Control Plane MCP uses a service account that may only have access to one organization. To enable AI-assisted debugging across both staging and production, you need to grant cross-org permissions.
The MCP authenticates using a service account token. For example:
Service Account: /org/shakacode-open-source-examples-staging/serviceaccount/claude
This service account has full access to staging but needs explicit permissions for production.
Run these commands to grant the staging service account access to production:
# Step 1: Create a policy in the PRODUCTION org that grants access to the staging service account
cpln policy create mcp-claude-access \
--org shakacode-open-source-examples-production \
--description "Grants MCP claude service account access for AI-assisted debugging" \
--target-kind gvc \
--target-links "//gvc/react-webpack-rails-tutorial-production" \
--bindings "/org/shakacode-open-source-examples-staging/serviceaccount/claude=edit"
# Step 2: Also grant access to workloads within the GVC
cpln policy create mcp-claude-workload-access \
--org shakacode-open-source-examples-production \
--description "Grants MCP claude service account workload access" \
--target-kind workload \
--target-all \
--gvc react-webpack-rails-tutorial-production \
--bindings "/org/shakacode-open-source-examples-staging/serviceaccount/claude=edit"
# Step 3: Grant access to view/manage secrets if needed
cpln policy create mcp-claude-secret-access \
--org shakacode-open-source-examples-production \
--description "Grants MCP claude service account secret access" \
--target-kind secret \
--target-all \
--bindings "/org/shakacode-open-source-examples-staging/serviceaccount/claude=view"- Go to shakacode-open-source-examples-production org
- Navigate to Policies
- Create a new policy:
- Name:
mcp-claude-access - Target Kind:
gvc - Target:
react-webpack-rails-tutorial-production - Bindings: Add
/org/shakacode-open-source-examples-staging/serviceaccount/claudewitheditpermission
- Name:
After setting up permissions, test with MCP:
# Set context to production
mcp__cpln__set_context(
org="shakacode-open-source-examples-production",
defaultGvc="react-webpack-rails-tutorial-production"
)
# Try to list workloads - should work now
mcp__cpln__list_workloads(gvc="react-webpack-rails-tutorial-production")- Least privilege: Only grant the permissions needed (view vs edit vs manage)
- Audit trail: All MCP actions are logged in Control Plane audit logs
- Separate tokens: For stricter security, create separate service accounts per environment
- Time-limited access: Consider creating temporary policies for incident response
If you prefer separate credentials per environment, configure multiple MCP servers:
{
"mcpServers": {
"cpln-staging": {
"command": "controlplane-mcp",
"env": {
"CPLN_TOKEN": "<staging-service-account-token>",
"CPLN_ORG": "shakacode-open-source-examples-staging"
}
},
"cpln-production": {
"command": "controlplane-mcp",
"env": {
"CPLN_TOKEN": "<production-service-account-token>",
"CPLN_ORG": "shakacode-open-source-examples-production"
}
}
}
}This gives you separate tool prefixes: mcp__cpln-staging__* and mcp__cpln-production__*.
- Control Plane MCP Guide
- cpflow Documentation
- Control Plane Official Docs