K3s for Agile Teams: Feature Branches, QA, Staging, and Production on One Cluster
How to run a full agile environment lifecycle on K3s: dynamic feature branch previews, QA gates, staging mirrors, production RBAC, GitOps promotion pipelines, and safe database migrations.
A sprint-based team working on a shared codebase needs more than a single cluster with one running copy of the app. It needs environments β isolated, reproducible, and controlled. One per feature branch for developer testing. One for QA to validate without developers overwriting their work. One that mirrors production exactly for final sign-off. And production itself, locked down so that only the release pipeline can touch it.
Getting this right on a budget is where K3s earns its place. A single well-configured K3s cluster can host all of these environments simultaneously, with hard isolation between them, automated lifecycle management, and the same GitOps discipline that scales to multi-cluster setups when the team grows.
This post builds on the K3s fundamentals and focuses entirely on the agile team workflow: how features move from a branch to production, what happens at each gate, who can touch what, and how the system recovers when something goes wrong.
The Environment Model
The foundational decision is whether to use one cluster with namespace isolation or multiple clusters. For most agile teams under 20 engineers, one cluster with well-separated namespaces is the right starting point. The operational overhead of maintaining multiple clusters is significant and rarely justified until the team size or compliance requirements demand it.
The namespace model for a typical agile team:
cluster: k3s-team
βββ ns: feature-pr-142 β ephemeral, created on PR open, deleted on merge
βββ ns: feature-pr-167 β ephemeral, created on PR open, deleted on merge
βββ ns: qa β stable, updated on merge to main
βββ ns: staging β stable, mirrors production config, updated on release branch
βββ ns: production β stable, updated only via approved pipeline
βββ ns: monitoring β Prometheus, Grafana, Loki β observes all environments
βββ ns: flux-system β GitOps controller
Each namespace gets its own:
- Deployments and services (isolated workloads)
- Secrets (environment-specific credentials)
- ResourceQuota (feature namespaces are limited, production has more headroom)
- NetworkPolicy (feature branches cannot reach production databases)
- Traefik IngressRoute (unique subdomain per environment)
The databases live as StatefulSets inside each namespace for feature branch environments (ephemeral, disposable data) and as external managed databases for staging and production (persistent, shared, requiring migration management).
Repository Structure
Before the cluster configuration, the repository layout determines how clean the environment separation can be. A monorepo with Kustomize overlays is the most maintainable structure for this pattern:
repo/
βββ app/ # application source code
β βββ cmd/
β βββ internal/
β βββ Dockerfile
βββ k8s/
β βββ base/ # shared manifests (Deployment, Service, etc.)
β β βββ deployment.yaml
β β βββ service.yaml
β β βββ kustomization.yaml
β β βββ migrations-job.yaml
β βββ overlays/
β βββ feature/ # template for feature branch environments
β β βββ kustomization.yaml
β β βββ namespace.yaml
β β βββ ingress.yaml
β β βββ postgres.yaml # ephemeral DB for features
β β βββ resource-quota.yaml
β βββ qa/
β β βββ kustomization.yaml
β β βββ namespace.yaml
β β βββ ingress.yaml
β β βββ hpa.yaml
β βββ staging/
β β βββ kustomization.yaml
β β βββ namespace.yaml
β β βββ ingress.yaml
β β βββ hpa.yaml
β βββ production/
β βββ kustomization.yaml
β βββ namespace.yaml
β βββ ingress.yaml
β βββ hpa.yaml
β βββ pdb.yaml # PodDisruptionBudget
βββ .github/
βββ workflows/
βββ feature-deploy.yml
βββ qa-deploy.yml
βββ staging-deploy.yml
βββ production-deploy.yml
The base/ directory contains manifests that are valid across all environments. Overlays patch them with environment-specific values: image tags, replica counts, resource limits, ingress hostnames, and environment variables.
Feature Branch Environments
A feature branch environment is created automatically when a pull request is opened and destroyed when the PR is merged or closed. Each environment gets a unique subdomain derived from the PR number, an ephemeral database seeded with anonymized test data, and its own resource quota to prevent it from consuming the entire cluster.
The Base Kustomization
# k8s/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- migrations-job.yaml
# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 1
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z $DB_HOST 5432; do sleep 2; done']
env:
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: api-config
key: db-host
containers:
- name: api
image: ghcr.io/yourorg/myapp/api:latest # patched by overlay
ports:
- containerPort: 8080
envFrom:
- configMapRef:
name: api-config
- secretRef:
name: api-secret
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
Feature Overlay
# k8s/overlays/feature/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: feature-pr-PRNUMBER # replaced by CI
namePrefix: ""
resources:
- ../../base
- namespace.yaml
- postgres.yaml
- ingress.yaml
- resource-quota.yaml
patches:
- patch: |-
- op: replace
path: /spec/replicas
value: 1
target:
kind: Deployment
name: api
- patch: |-
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "200m"
target:
kind: Deployment
name: api
images:
- name: ghcr.io/yourorg/myapp/api
newTag: "GITSHA" # replaced by CI
# k8s/overlays/feature/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: feature-quota
spec:
hard:
requests.cpu: "500m"
requests.memory: 512Mi
limits.cpu: "1"
limits.memory: 1Gi
pods: "10"
# k8s/overlays/feature/postgres.yaml
# ephemeral PostgreSQL β data is disposable
apiVersion: apps/v1
kind: Deployment # Deployment, not StatefulSet β ephemeral is fine for features
metadata:
name: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:17-alpine
env:
- name: POSTGRES_DB
value: myapp
- name: POSTGRES_USER
value: myapp
- name: POSTGRES_PASSWORD
value: feature-local-password # not real creds, feature env only
ports:
- containerPort: 5432
# no persistent volume β data lives only as long as the pod
---
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
selector:
app: postgres
ports:
- port: 5432
# k8s/overlays/feature/ingress.yaml
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: api-feature
spec:
entryPoints:
- websecure
routes:
- match: Host(`pr-PRNUMBER.preview.yourteam.dev`)
kind: Rule
services:
- name: api
port: 80
tls:
certResolver: letsencrypt
GitHub Actions: PR Open β Deploy Feature Environment
# .github/workflows/feature-deploy.yml
name: Feature Environment
on:
pull_request:
types: [opened, synchronize, reopened]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}/api
PR_NAMESPACE: feature-pr-${{ github.event.number }}
PR_HOST: pr-${{ github.event.number }}.preview.yourteam.dev
jobs:
deploy-feature:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
pull-requests: write # to post comment with preview URL
steps:
- uses: actions/checkout@v4
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v6
with:
context: .
push: true
tags: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Generate manifests
run: |
cd k8s/overlays/feature
# replace placeholders
sed -i "s/feature-pr-PRNUMBER/${{ env.PR_NAMESPACE }}/g" kustomization.yaml namespace.yaml ingress.yaml
sed -i "s/pr-PRNUMBER/${{ github.event.number }}/g" ingress.yaml
sed -i "s/GITSHA/${{ github.sha }}/g" kustomization.yaml
# build final manifests
kustomize build . > /tmp/feature-manifests.yaml
- name: Deploy to K3s
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
kubectl apply -f /tmp/feature-manifests.yaml
# wait for deployment
kubectl rollout status deployment/api \
--namespace ${{ env.PR_NAMESPACE }} \
--timeout=3m
- name: Seed test data
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
# run a seed job in the namespace
kubectl create job seed-$(date +%s) \
--namespace ${{ env.PR_NAMESPACE }} \
--image=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
-- /app/seed --env=feature
kubectl wait job \
--namespace ${{ env.PR_NAMESPACE }} \
--for=condition=complete \
--selector=job-name \
--timeout=2m || true
- name: Comment preview URL on PR
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `**Preview environment deployed.**\n\n` +
`URL: https://${{ env.PR_HOST }}\n` +
`Namespace: \`${{ env.PR_NAMESPACE }}\`\n` +
`Image: \`${{ github.sha }}\`\n\n` +
`This environment will be destroyed when the PR is merged or closed.`
})
GitHub Actions: PR Closed β Destroy Feature Environment
# .github/workflows/feature-cleanup.yml
name: Cleanup Feature Environment
on:
pull_request:
types: [closed]
jobs:
cleanup:
runs-on: ubuntu-latest
steps:
- name: Delete namespace
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
NAMESPACE=feature-pr-${{ github.event.number }}
if kubectl get namespace $NAMESPACE &>/dev/null; then
kubectl delete namespace $NAMESPACE --timeout=2m
echo "Deleted namespace: $NAMESPACE"
else
echo "Namespace $NAMESPACE not found, skipping"
fi
Deleting the namespace cascades β all deployments, services, pods, PVCs, secrets, and configmaps inside it are deleted automatically.
QA Environment
The QA namespace is stable β it does not get recreated per PR. It is updated automatically when code merges to the main branch. QA engineers own this environment: they run regression suites, exploratory testing, and sign off on features before they go to staging.
Key differences from feature environments:
- External database (persistent PostgreSQL, not ephemeral), shared by QA testers
- Stable test data maintained by QA (not overwritten on every deploy)
- Two replicas for realistic load testing
- Auto-scaling disabled (so load tests are predictable)
- Test results must pass before promotion to staging is allowed
# k8s/overlays/qa/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: qa
resources:
- ../../base
- namespace.yaml
- ingress.yaml
- hpa.yaml
patches:
- patch: |-
- op: replace
path: /spec/replicas
value: 2
target:
kind: Deployment
name: api
- patch: |-
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
target:
kind: Deployment
name: api
images:
- name: ghcr.io/yourorg/myapp/api
newTag: "GITSHA"
configMapGenerator:
- name: api-config
literals:
- db-host=postgres.qa.svc.cluster.local
- app-env=qa
- log-level=debug
# k8s/overlays/qa/ingress.yaml
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: api-qa
namespace: qa
spec:
entryPoints:
- websecure
routes:
- match: Host(`qa.yourteam.dev`)
kind: Rule
middlewares:
- name: basic-auth # QA endpoint is not public
namespace: qa
services:
- name: api
port: 80
tls:
certResolver: letsencrypt
---
# basic auth β prevent public access to QA
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: basic-auth
namespace: qa
spec:
basicAuth:
secret: qa-basic-auth-secret
Deploy to QA on Merge to Main
# .github/workflows/qa-deploy.yml
name: QA Deploy
on:
push:
branches: [main]
jobs:
deploy-qa:
runs-on: ubuntu-latest
environment: qa # GitHub environment with required reviewers if needed
steps:
- uses: actions/checkout@v4
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v6
with:
context: .
push: true
tags: |
ghcr.io/${{ github.repository }}/api:${{ github.sha }}
ghcr.io/${{ github.repository }}/api:qa-latest
- name: Run database migrations (QA)
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
# run migrations as a Job before updating the Deployment
cat << EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: migrate-${{ github.sha }}
namespace: qa
spec:
ttlSecondsAfterFinished: 300
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: ghcr.io/${{ github.repository }}/api:${{ github.sha }}
command: ["/app/migrate", "up"]
envFrom:
- secretRef:
name: api-secret
- configMapRef:
name: api-config
EOF
# wait for migration to complete
kubectl wait job/migrate-${{ github.sha }} \
--namespace qa \
--for=condition=complete \
--timeout=5m
- name: Deploy to QA
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
kubectl set image deployment/api \
api=ghcr.io/${{ github.repository }}/api:${{ github.sha }} \
--namespace qa
kubectl rollout status deployment/api \
--namespace qa \
--timeout=5m
- name: Run smoke tests against QA
run: |
# give the deployment time to stabilize
sleep 15
# run a lightweight smoke test suite
curl -sf https://qa.yourteam.dev/health
curl -sf https://qa.yourteam.dev/api/v1/status
echo "QA smoke tests passed"
- name: Notify QA team
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "QA deployed: `${{ github.sha }}` β ${{ github.event.head_commit.message }}\nhttps://qa.yourteam.dev"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_QA_WEBHOOK }}
QA Gates
The QA gate is the checkpoint between QA and staging. It must be explicit β not automatic. The process:
- QA engineer runs the full regression suite against
qa.yourteam.dev - QA engineer marks the GitHub environment as approved (or leaves a comment in the PR used to track the release)
- The staging deploy workflow requires QA approval to proceed
In GitHub, configure the staging environment to require a named reviewer β the QA lead or QA engineer. This creates a pause in the pipeline that cannot be bypassed by pushing code.
# in staging deploy workflow
jobs:
deploy-staging:
environment:
name: staging
url: https://staging.yourteam.dev
# GitHub will pause here until someone with access to the 'staging'
# environment approves the deployment in the Actions UI
Staging Environment
Staging is a production mirror. It runs the same replica count, the same resource limits, the same database engine version, the same secrets structure, and the same ingress configuration as production. The only difference is the domain name and the data (anonymized copies of production data, refreshed on a schedule).
This fidelity is the point. A bug that only appears under production-like conditions β a query that is slow on large data sets, a race condition that requires multiple replicas, a timeout that only triggers under real traffic patterns β must surface in staging, not production.
# k8s/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: staging
resources:
- ../../base
- namespace.yaml
- ingress.yaml
- hpa.yaml
- pdb.yaml
patches:
- patch: |-
- op: replace
path: /spec/replicas
value: 3 # same as production
target:
kind: Deployment
name: api
- patch: |-
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"
target:
kind: Deployment
name: api
images:
- name: ghcr.io/yourorg/myapp/api
newTag: "GITSHA"
configMapGenerator:
- name: api-config
literals:
- db-host=postgres.staging.svc.cluster.local
- app-env=staging
- log-level=info # same as production β info, not debug
# k8s/overlays/staging/pdb.yaml
# PodDisruptionBudget β at least 2 pods must be running during node maintenance
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: staging
spec:
minAvailable: 2
selector:
matchLabels:
app: api
# k8s/overlays/staging/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: staging
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 8
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Staging Data Refresh
Staging with stale or empty data misses whole categories of bugs. Set up a weekly job that copies and anonymizes production data into staging:
# k8s/overlays/staging/data-refresh-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: staging-data-refresh
namespace: staging
spec:
schedule: "0 2 * * 0" # every Sunday at 2 AM
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
serviceAccountName: staging-data-refresh
containers:
- name: refresh
image: ghcr.io/yourorg/myapp/tools:latest
command:
- /app/tools
- refresh-staging-data
- --source=production
- --target=staging
- --anonymize-pii
envFrom:
- secretRef:
name: data-refresh-secret
The --anonymize-pii flag is not optional. Copying real user data to a less-controlled environment without anonymization is a GDPR/CCPA violation. The refresh tool must replace emails, names, phone numbers, and any other PII with generated equivalents before writing to staging.
Production Environment
Production has the strictest controls. No developer can deploy to it directly. The only path to production is through the pipeline, and the pipeline requires QA sign-off on staging.
# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production
resources:
- ../../base
- namespace.yaml
- ingress.yaml
- hpa.yaml
- pdb.yaml
patches:
- patch: |-
- op: replace
path: /spec/replicas
value: 3
target:
kind: Deployment
name: api
- patch: |-
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"
target:
kind: Deployment
name: api
- patch: |-
- op: add
path: /spec/strategy
value:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # zero downtime β always have 3 pods running
target:
kind: Deployment
name: api
images:
- name: ghcr.io/yourorg/myapp/api
newTag: "GITSHA"
# k8s/overlays/production/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2 # even during node drain, 2 pods must be running
selector:
matchLabels:
app: api
Production Deploy Workflow
# .github/workflows/production-deploy.yml
name: Production Deploy
on:
workflow_dispatch: # manual trigger only β no automatic production deploys
inputs:
image_sha:
description: 'Git SHA of the image to deploy (must be tested in staging)'
required: true
type: string
jobs:
validate:
runs-on: ubuntu-latest
steps:
- name: Verify image exists in registry
run: |
docker manifest inspect ghcr.io/${{ github.repository }}/api:${{ inputs.image_sha }}
- name: Verify image was deployed to staging
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
STAGING_SHA=$(kubectl get deployment api \
--namespace staging \
-o jsonpath='{.spec.template.spec.containers[0].image}' \
| cut -d: -f2)
if [ "$STAGING_SHA" != "${{ inputs.image_sha }}" ]; then
echo "ERROR: Image ${{ inputs.image_sha }} was not the last staging deploy."
echo "Staging is running: $STAGING_SHA"
echo "Deploy to staging first, validate, then promote to production."
exit 1
fi
echo "Staging validation passed. Image matches."
deploy-production:
needs: validate
runs-on: ubuntu-latest
environment:
name: production
url: https://api.yourteam.dev
# production environment requires 2 named approvers in GitHub settings
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.image_sha }}
- name: Run database migrations (production)
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
# migrations run before the rolling update starts
cat << EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: migrate-${{ inputs.image_sha }}
namespace: production
spec:
ttlSecondsAfterFinished: 3600
backoffLimit: 0 # no retries β migration failures must be investigated
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: ghcr.io/${{ github.repository }}/api:${{ inputs.image_sha }}
command: ["/app/migrate", "up"]
envFrom:
- secretRef:
name: api-secret
- configMapRef:
name: api-config
EOF
kubectl wait job/migrate-${{ inputs.image_sha }} \
--namespace production \
--for=condition=complete \
--timeout=10m
echo "Migrations complete"
- name: Deploy to production (rolling update)
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
kubectl set image deployment/api \
api=ghcr.io/${{ github.repository }}/api:${{ inputs.image_sha }} \
--namespace production
kubectl rollout status deployment/api \
--namespace production \
--timeout=10m
- name: Production smoke test
run: |
sleep 10
curl -sf https://api.yourteam.dev/health
echo "Production smoke test passed"
- name: Notify on success
if: success()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Production deployed: `${{ inputs.image_sha }}`\nhttps://api.yourteam.dev"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_PROD_WEBHOOK }}
- name: Rollback on failure
if: failure()
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
kubectl rollout undo deployment/api --namespace production
echo "Rolled back production deployment"
The workflow_dispatch trigger with a required image_sha input means production deployments are always intentional and always reference a specific, validated artifact. Nobody accidentally triggers a production deploy by pushing to a branch.
RBAC: Who Can Touch What
Kubernetes RBAC maps to your team structure. The principle is least privilege: developers can see everything but only modify their own feature namespaces. QA can modify the QA namespace. Only the CI service account can modify staging and production.
# rbac/developer-role.yaml
# Developers: read across all namespaces, write only in feature namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: developer
rules:
- apiGroups: ["", "apps", "batch"]
resources: ["pods", "deployments", "services", "jobs", "configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"] # allow exec for debugging in feature namespaces only
---
# Bind developer ClusterRole for read across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: developers-read
subjects:
- kind: Group
name: developers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: developer
apiGroup: rbac.authorization.k8s.io
# rbac/qa-role.yaml
# QA: full access to the qa namespace, read elsewhere
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: qa-admin
namespace: qa
rules:
- apiGroups: ["", "apps", "batch"]
resources: ["*"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: qa-team-binding
namespace: qa
subjects:
- kind: Group
name: qa-engineers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: qa-admin
apiGroup: rbac.authorization.k8s.io
# rbac/ci-production-role.yaml
# CI service account: can update deployments in staging and production
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ci-deploy
namespace: production
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "patch", "list"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "list", "watch"]
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ci-deploy-binding
namespace: production
subjects:
- kind: ServiceAccount
name: github-ci
namespace: kube-system
roleRef:
kind: Role
name: ci-deploy
apiGroup: rbac.authorization.k8s.io
No developer has write access to the staging or production namespaces. This is enforced at the API level β it cannot be bypassed by running kubectl apply manually.
Database Migrations in the Pipeline
Migrations are the most dangerous part of any deployment. A migration that cannot be rolled back can lock you out of a rollback if the new code breaks. The safe pattern: every migration must be backward compatible with the previous version of the application code.
The rules:
- Never drop a column or table in the same migration that renames it. First deploy renames (adds new, keeps old). Second deploy removes old.
- Never add a NOT NULL column without a default. Add with default β deploy β backfill β make non-null in a third migration.
- Never rename a column directly. Add new column β deploy both versions read from both β migrate data β remove old column in a later release.
This allows a safe rollback pattern: if the new deployment breaks, roll back the code but do not roll back the migration. The previous version of the code still works with the new schema because the migration was backward compatible.
# k8s/base/migrations-job.yaml
# run as an init container before the main deployment
# alternative: run as a separate Job before kubectl set image
apiVersion: batch/v1
kind: Job
metadata:
name: migrate # name is patched per deploy by kustomize or CI
spec:
backoffLimit: 3
activeDeadlineSeconds: 300
template:
spec:
restartPolicy: OnFailure
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z $DB_HOST 5432; do sleep 2; done']
env:
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: api-config
key: db-host
containers:
- name: migrate
image: ghcr.io/yourorg/myapp/api:latest # patched by CI
command: ["/app/migrate", "up"]
envFrom:
- secretRef:
name: api-secret
- configMapRef:
name: api-config
In the CI pipeline, the sequence is always:
1. Apply migration Job β wait for completion
2. Only if migrations succeed β kubectl set image (rolling update)
3. If rolling update fails β kubectl rollout undo (no migration rollback needed β schema is backward compatible)
GitOps Promotion with Flux
For teams that want Git as the single source of truth for every environment β not just production β Flux can manage the full promotion chain. Each environment has its own Kustomization resource that watches a different path or branch.
git repository structure for GitOps:
infrastructure/
βββ clusters/
β βββ team-cluster/
β βββ flux-system/ β Flux bootstrap manifests
β βββ qa.yaml β Kustomization watching overlays/qa
β βββ staging.yaml β Kustomization watching overlays/staging
β βββ production.yaml β Kustomization watching overlays/production
# infrastructure/clusters/team-cluster/qa.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: myapp-qa
namespace: flux-system
spec:
interval: 2m
path: ./k8s/overlays/qa
prune: true
sourceRef:
kind: GitRepository
name: myapp
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: api
namespace: qa
postBuild:
substitute:
ENVIRONMENT: qa
# infrastructure/clusters/team-cluster/production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: myapp-production
namespace: flux-system
spec:
interval: 10m
path: ./k8s/overlays/production
prune: true
sourceRef:
kind: GitRepository
name: myapp
suspend: false # set to true to pause auto-sync during incidents
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: api
namespace: production
timeout: 10m
With this setup, promoting from staging to production becomes a Git operation β update the image tag in k8s/overlays/production/kustomization.yaml via a PR, get it reviewed and merged, and Flux applies it. The audit trail is the git history.
Observability Per Environment
Each environment needs metrics and logs, but the alerting rules differ. Feature environments generate noise if you alert on them. Production alerts must wake someone up.
# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-alerts
namespace: monitoring
spec:
groups:
- name: api.production
interval: 30s
rules:
# alert only fires for production namespace
- alert: ProductionHighErrorRate
expr: |
sum(rate(http_requests_total{namespace="production", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="production"}[5m])) > 0.05
for: 2m
labels:
severity: critical
environment: production
annotations:
summary: "Production error rate above 5%"
runbook: "https://wiki.yourteam.dev/runbooks/high-error-rate"
- alert: ProductionPodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{namespace="production"}[15m]) > 3
for: 5m
labels:
severity: critical
environment: production
- name: api.staging
rules:
# staging alerts: warning only, no paging
- alert: StagingHighErrorRate
expr: |
sum(rate(http_requests_total{namespace="staging", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="staging"}[5m])) > 0.10
for: 5m
labels:
severity: warning
environment: staging
In Grafana, create separate dashboards per environment using namespace as a variable:
Dashboard variable: namespace
Values: production, staging, qa
Default: production
Panel query:
rate(http_requests_total{namespace="$namespace"}[5m])
One dashboard, four environments, one dropdown to switch between them. QA and developers can check their environmentβs health without touching production dashboards.
The Full Feature Lifecycle
Putting it all together β the path a feature takes from code to production:
1. Developer creates feature branch from main
β PR opened
β GitHub Actions builds image, creates namespace feature-pr-142
β Ephemeral PostgreSQL deployed, test data seeded
β Preview URL posted as PR comment: https://pr-142.preview.yourteam.dev
β Developer iterates, pushes commits, CI rebuilds and redeploys
2. PR review
β Code review by peers
β Developer can share preview URL with stakeholders for early feedback
β Tests run in CI (unit, integration) β must pass before merge
3. PR merged to main
β Namespace feature-pr-142 deleted (all resources cascade)
β CI builds final image for the commit SHA
β Image pushed to GHCR with SHA tag
β QA deploy workflow triggers
β Migrations run against QA database
β QA namespace updated to new image
β Slack notification to QA channel
4. QA validation
β QA engineer tests qa.yourteam.dev
β Runs regression suite
β Files bugs as issues if found (back to step 1 for fixes)
β When satisfied, approves the GitHub 'staging' environment
5. Staging deploy (requires QA approval)
β Pipeline resumes after approval
β Migrations run against staging database
β Staging namespace updated
β Automated smoke tests run
β Slack notification to team channel
β Release manager validates staging.yourteam.dev
6. Production deploy (manual, requires image SHA)
β Release manager triggers production workflow manually
β Specifies the image SHA validated in staging
β Pipeline validates SHA matches staging deployment
β GitHub 'production' environment requires 2 approvers
β After approval: migrations run, rolling update begins
β Zero downtime: maxUnavailable=0 guarantees at least 3 pods always up
β Smoke tests verify production endpoint
β Monitoring watches for error rate increase for 30 minutes post-deploy
β If error rate spikes: automatic rollback via kubectl rollout undo
This model β ephemeral feature environments, stable QA, production-mirror staging, locked production β is not about process for its own sake. It exists because the cost of a production bug is orders of magnitude higher than the cost of finding it in QA or staging. Every gate is a risk reduction mechanism. Every piece of automation removes a manual step that would otherwise be skipped under deadline pressure.
The best time to find a bug is before it reaches production. The environment model exists to make that possible β every time, not just when someone remembers to test.
Tags
Related Articles
Automation with Go: Building Scalable, Concurrent Systems for Real-World Tasks
Master Go for automation. Learn to build fast, concurrent automation tools, CLI utilities, monitoring systems, and deployment pipelines. Go's concurrency model makes it perfect for real-world automation.
Automation Tools for Developers: Real Workflows Without AI - CLI, Scripts & Open Source
Master free automation tools for developers. Learn to automate repetitive tasks, workflows, deployments, monitoring, and operations. Build custom automation pipelines with open-source toolsβno AI needed.
Data Analysis for Backend Engineers: Using Metrics to Make Better Technical Decisions
Master data analysis as a backend engineer. Learn to collect meaningful metrics, analyze performance data, avoid common pitfalls, and make technical decisions backed by evidence instead of hunches.