K3s for Agile Teams: Feature Branches, QA, Staging, and Production on One Cluster

A sprint-based team working on a shared codebase needs more than a single cluster with one running copy of the app. It needs environments — isolated, reproducible, and controlled. One per feature branch for developer testing. One for QA to validate without developers overwriting their work. One that mirrors production exactly for final sign-off. And production itself, locked down so that only the release pipeline can touch it.

Getting this right on a budget is where K3s earns its place. A single well-configured K3s cluster can host all of these environments simultaneously, with hard isolation between them, automated lifecycle management, and the same GitOps discipline that scales to multi-cluster setups when the team grows.

This post builds on the K3s fundamentals and focuses entirely on the agile team workflow: how features move from a branch to production, what happens at each gate, who can touch what, and how the system recovers when something goes wrong.

The Environment Model

The foundational decision is whether to use one cluster with namespace isolation or multiple clusters. For most agile teams under 20 engineers, one cluster with well-separated namespaces is the right starting point. The operational overhead of maintaining multiple clusters is significant and rarely justified until the team size or compliance requirements demand it.

The namespace model for a typical agile team:

cluster: k3s-team
├── ns: feature-pr-142        ← ephemeral, created on PR open, deleted on merge
├── ns: feature-pr-167        ← ephemeral, created on PR open, deleted on merge
├── ns: qa                    ← stable, updated on merge to main
├── ns: staging               ← stable, mirrors production config, updated on release branch
├── ns: production            ← stable, updated only via approved pipeline
├── ns: monitoring            ← Prometheus, Grafana, Loki — observes all environments
└── ns: flux-system           ← GitOps controller

Each namespace gets its own:

Deployments and services (isolated workloads)
Secrets (environment-specific credentials)
ResourceQuota (feature namespaces are limited, production has more headroom)
NetworkPolicy (feature branches cannot reach production databases)
Traefik IngressRoute (unique subdomain per environment)

The databases live as StatefulSets inside each namespace for feature branch environments (ephemeral, disposable data) and as external managed databases for staging and production (persistent, shared, requiring migration management).

Repository Structure

Before the cluster configuration, the repository layout determines how clean the environment separation can be. A monorepo with Kustomize overlays is the most maintainable structure for this pattern:

repo/
├── app/                        # application source code
│   ├── cmd/
│   ├── internal/
│   └── Dockerfile
├── k8s/
│   ├── base/                   # shared manifests (Deployment, Service, etc.)
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   ├── kustomization.yaml
│   │   └── migrations-job.yaml
│   └── overlays/
│       ├── feature/            # template for feature branch environments
│       │   ├── kustomization.yaml
│       │   ├── namespace.yaml
│       │   ├── ingress.yaml
│       │   ├── postgres.yaml   # ephemeral DB for features
│       │   └── resource-quota.yaml
│       ├── qa/
│       │   ├── kustomization.yaml
│       │   ├── namespace.yaml
│       │   ├── ingress.yaml
│       │   └── hpa.yaml
│       ├── staging/
│       │   ├── kustomization.yaml
│       │   ├── namespace.yaml
│       │   ├── ingress.yaml
│       │   └── hpa.yaml
│       └── production/
│           ├── kustomization.yaml
│           ├── namespace.yaml
│           ├── ingress.yaml
│           ├── hpa.yaml
│           └── pdb.yaml        # PodDisruptionBudget
└── .github/
    └── workflows/
        ├── feature-deploy.yml
        ├── qa-deploy.yml
        ├── staging-deploy.yml
        └── production-deploy.yml

The base/ directory contains manifests that are valid across all environments. Overlays patch them with environment-specific values: image tags, replica counts, resource limits, ingress hostnames, and environment variables.

Feature Branch Environments

A feature branch environment is created automatically when a pull request is opened and destroyed when the PR is merged or closed. Each environment gets a unique subdomain derived from the PR number, an ephemeral database seeded with anonymized test data, and its own resource quota to prevent it from consuming the entire cluster.

The Base Kustomization

# k8s/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml
  - migrations-job.yaml

# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      initContainers:
        - name: wait-for-db
          image: busybox:1.36
          command: ['sh', '-c', 'until nc -z $DB_HOST 5432; do sleep 2; done']
          env:
            - name: DB_HOST
              valueFrom:
                configMapKeyRef:
                  name: api-config
                  key: db-host
      containers:
        - name: api
          image: ghcr.io/yourorg/myapp/api:latest   # patched by overlay
          ports:
            - containerPort: 8080
          envFrom:
            - configMapRef:
                name: api-config
            - secretRef:
                name: api-secret
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20

Feature Overlay

# k8s/overlays/feature/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: feature-pr-PRNUMBER     # replaced by CI
namePrefix: ""
resources:
  - ../../base
  - namespace.yaml
  - postgres.yaml
  - ingress.yaml
  - resource-quota.yaml
patches:
  - patch: |-
      - op: replace
        path: /spec/replicas
        value: 1
    target:
      kind: Deployment
      name: api
  - patch: |-
      - op: replace
        path: /spec/template/spec/containers/0/resources
        value:
          requests:
            memory: "64Mi"
            cpu: "50m"
          limits:
            memory: "128Mi"
            cpu: "200m"
    target:
      kind: Deployment
      name: api
images:
  - name: ghcr.io/yourorg/myapp/api
    newTag: "GITSHA"               # replaced by CI

# k8s/overlays/feature/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: feature-quota
spec:
  hard:
    requests.cpu: "500m"
    requests.memory: 512Mi
    limits.cpu: "1"
    limits.memory: 1Gi
    pods: "10"

# k8s/overlays/feature/postgres.yaml
# ephemeral PostgreSQL — data is disposable
apiVersion: apps/v1
kind: Deployment    # Deployment, not StatefulSet — ephemeral is fine for features
metadata:
  name: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:17-alpine
          env:
            - name: POSTGRES_DB
              value: myapp
            - name: POSTGRES_USER
              value: myapp
            - name: POSTGRES_PASSWORD
              value: feature-local-password    # not real creds, feature env only
          ports:
            - containerPort: 5432
          # no persistent volume — data lives only as long as the pod
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  selector:
    app: postgres
  ports:
    - port: 5432

# k8s/overlays/feature/ingress.yaml
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: api-feature
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`pr-PRNUMBER.preview.yourteam.dev`)
      kind: Rule
      services:
        - name: api
          port: 80
  tls:
    certResolver: letsencrypt

GitHub Actions: PR Open → Deploy Feature Environment

# .github/workflows/feature-deploy.yml
name: Feature Environment

on:
  pull_request:
    types: [opened, synchronize, reopened]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}/api
  PR_NAMESPACE: feature-pr-${{ github.event.number }}
  PR_HOST: pr-${{ github.event.number }}.preview.yourteam.dev

jobs:
  deploy-feature:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      pull-requests: write    # to post comment with preview URL

    steps:
      - uses: actions/checkout@v4

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Generate manifests
        run: |
          cd k8s/overlays/feature

          # replace placeholders
          sed -i "s/feature-pr-PRNUMBER/${{ env.PR_NAMESPACE }}/g" kustomization.yaml namespace.yaml ingress.yaml
          sed -i "s/pr-PRNUMBER/${{ github.event.number }}/g" ingress.yaml
          sed -i "s/GITSHA/${{ github.sha }}/g" kustomization.yaml

          # build final manifests
          kustomize build . > /tmp/feature-manifests.yaml

      - name: Deploy to K3s
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          kubectl apply -f /tmp/feature-manifests.yaml

          # wait for deployment
          kubectl rollout status deployment/api \
            --namespace ${{ env.PR_NAMESPACE }} \
            --timeout=3m

      - name: Seed test data
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          # run a seed job in the namespace
          kubectl create job seed-$(date +%s) \
            --namespace ${{ env.PR_NAMESPACE }} \
            --image=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            -- /app/seed --env=feature

          kubectl wait job \
            --namespace ${{ env.PR_NAMESPACE }} \
            --for=condition=complete \
            --selector=job-name \
            --timeout=2m || true

      - name: Comment preview URL on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `**Preview environment deployed.**\n\n` +
                    `URL: https://${{ env.PR_HOST }}\n` +
                    `Namespace: \`${{ env.PR_NAMESPACE }}\`\n` +
                    `Image: \`${{ github.sha }}\`\n\n` +
                    `This environment will be destroyed when the PR is merged or closed.`
            })

GitHub Actions: PR Closed → Destroy Feature Environment

# .github/workflows/feature-cleanup.yml
name: Cleanup Feature Environment

on:
  pull_request:
    types: [closed]

jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - name: Delete namespace
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          NAMESPACE=feature-pr-${{ github.event.number }}

          if kubectl get namespace $NAMESPACE &>/dev/null; then
            kubectl delete namespace $NAMESPACE --timeout=2m
            echo "Deleted namespace: $NAMESPACE"
          else
            echo "Namespace $NAMESPACE not found, skipping"
          fi

Deleting the namespace cascades — all deployments, services, pods, PVCs, secrets, and configmaps inside it are deleted automatically.

QA Environment

The QA namespace is stable — it does not get recreated per PR. It is updated automatically when code merges to the main branch. QA engineers own this environment: they run regression suites, exploratory testing, and sign off on features before they go to staging.

Key differences from feature environments:

External database (persistent PostgreSQL, not ephemeral), shared by QA testers
Stable test data maintained by QA (not overwritten on every deploy)
Two replicas for realistic load testing
Auto-scaling disabled (so load tests are predictable)
Test results must pass before promotion to staging is allowed

# k8s/overlays/qa/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: qa
resources:
  - ../../base
  - namespace.yaml
  - ingress.yaml
  - hpa.yaml
patches:
  - patch: |-
      - op: replace
        path: /spec/replicas
        value: 2
    target:
      kind: Deployment
      name: api
  - patch: |-
      - op: replace
        path: /spec/template/spec/containers/0/resources
        value:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "500m"
    target:
      kind: Deployment
      name: api
images:
  - name: ghcr.io/yourorg/myapp/api
    newTag: "GITSHA"
configMapGenerator:
  - name: api-config
    literals:
      - db-host=postgres.qa.svc.cluster.local
      - app-env=qa
      - log-level=debug

# k8s/overlays/qa/ingress.yaml
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: api-qa
  namespace: qa
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`qa.yourteam.dev`)
      kind: Rule
      middlewares:
        - name: basic-auth     # QA endpoint is not public
          namespace: qa
      services:
        - name: api
          port: 80
  tls:
    certResolver: letsencrypt
---
# basic auth — prevent public access to QA
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: basic-auth
  namespace: qa
spec:
  basicAuth:
    secret: qa-basic-auth-secret

Deploy to QA on Merge to Main

# .github/workflows/qa-deploy.yml
name: QA Deploy

on:
  push:
    branches: [main]

jobs:
  deploy-qa:
    runs-on: ubuntu-latest
    environment: qa            # GitHub environment with required reviewers if needed

    steps:
      - uses: actions/checkout@v4

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/api:${{ github.sha }}
            ghcr.io/${{ github.repository }}/api:qa-latest

      - name: Run database migrations (QA)
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          # run migrations as a Job before updating the Deployment
          cat << EOF | kubectl apply -f -
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: migrate-${{ github.sha }}
            namespace: qa
          spec:
            ttlSecondsAfterFinished: 300
            template:
              spec:
                restartPolicy: Never
                containers:
                  - name: migrate
                    image: ghcr.io/${{ github.repository }}/api:${{ github.sha }}
                    command: ["/app/migrate", "up"]
                    envFrom:
                      - secretRef:
                          name: api-secret
                      - configMapRef:
                          name: api-config
          EOF

          # wait for migration to complete
          kubectl wait job/migrate-${{ github.sha }} \
            --namespace qa \
            --for=condition=complete \
            --timeout=5m

      - name: Deploy to QA
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          kubectl set image deployment/api \
            api=ghcr.io/${{ github.repository }}/api:${{ github.sha }} \
            --namespace qa

          kubectl rollout status deployment/api \
            --namespace qa \
            --timeout=5m

      - name: Run smoke tests against QA
        run: |
          # give the deployment time to stabilize
          sleep 15

          # run a lightweight smoke test suite
          curl -sf https://qa.yourteam.dev/health
          curl -sf https://qa.yourteam.dev/api/v1/status

          echo "QA smoke tests passed"

      - name: Notify QA team
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "QA deployed: `${{ github.sha }}` — ${{ github.event.head_commit.message }}\nhttps://qa.yourteam.dev"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_QA_WEBHOOK }}

QA Gates

The QA gate is the checkpoint between QA and staging. It must be explicit — not automatic. The process:

QA engineer runs the full regression suite against qa.yourteam.dev
QA engineer marks the GitHub environment as approved (or leaves a comment in the PR used to track the release)
The staging deploy workflow requires QA approval to proceed

In GitHub, configure the staging environment to require a named reviewer — the QA lead or QA engineer. This creates a pause in the pipeline that cannot be bypassed by pushing code.

# in staging deploy workflow
jobs:
  deploy-staging:
    environment:
      name: staging
      url: https://staging.yourteam.dev
    # GitHub will pause here until someone with access to the 'staging'
    # environment approves the deployment in the Actions UI

Staging Environment

Staging is a production mirror. It runs the same replica count, the same resource limits, the same database engine version, the same secrets structure, and the same ingress configuration as production. The only difference is the domain name and the data (anonymized copies of production data, refreshed on a schedule).

This fidelity is the point. A bug that only appears under production-like conditions — a query that is slow on large data sets, a race condition that requires multiple replicas, a timeout that only triggers under real traffic patterns — must surface in staging, not production.

# k8s/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: staging
resources:
  - ../../base
  - namespace.yaml
  - ingress.yaml
  - hpa.yaml
  - pdb.yaml
patches:
  - patch: |-
      - op: replace
        path: /spec/replicas
        value: 3          # same as production
    target:
      kind: Deployment
      name: api
  - patch: |-
      - op: replace
        path: /spec/template/spec/containers/0/resources
        value:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "1000m"
    target:
      kind: Deployment
      name: api
images:
  - name: ghcr.io/yourorg/myapp/api
    newTag: "GITSHA"
configMapGenerator:
  - name: api-config
    literals:
      - db-host=postgres.staging.svc.cluster.local
      - app-env=staging
      - log-level=info    # same as production — info, not debug

# k8s/overlays/staging/pdb.yaml
# PodDisruptionBudget — at least 2 pods must be running during node maintenance
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: staging
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

# k8s/overlays/staging/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: staging
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Staging Data Refresh

Staging with stale or empty data misses whole categories of bugs. Set up a weekly job that copies and anonymizes production data into staging:

# k8s/overlays/staging/data-refresh-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: staging-data-refresh
  namespace: staging
spec:
  schedule: "0 2 * * 0"    # every Sunday at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          serviceAccountName: staging-data-refresh
          containers:
            - name: refresh
              image: ghcr.io/yourorg/myapp/tools:latest
              command:
                - /app/tools
                - refresh-staging-data
                - --source=production
                - --target=staging
                - --anonymize-pii
              envFrom:
                - secretRef:
                    name: data-refresh-secret

The --anonymize-pii flag is not optional. Copying real user data to a less-controlled environment without anonymization is a GDPR/CCPA violation. The refresh tool must replace emails, names, phone numbers, and any other PII with generated equivalents before writing to staging.

Production Environment

Production has the strictest controls. No developer can deploy to it directly. The only path to production is through the pipeline, and the pipeline requires QA sign-off on staging.

# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production
resources:
  - ../../base
  - namespace.yaml
  - ingress.yaml
  - hpa.yaml
  - pdb.yaml
patches:
  - patch: |-
      - op: replace
        path: /spec/replicas
        value: 3
    target:
      kind: Deployment
      name: api
  - patch: |-
      - op: replace
        path: /spec/template/spec/containers/0/resources
        value:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "1000m"
    target:
      kind: Deployment
      name: api
  - patch: |-
      - op: add
        path: /spec/strategy
        value:
          type: RollingUpdate
          rollingUpdate:
            maxSurge: 1
            maxUnavailable: 0     # zero downtime — always have 3 pods running
    target:
      kind: Deployment
      name: api
images:
  - name: ghcr.io/yourorg/myapp/api
    newTag: "GITSHA"

# k8s/overlays/production/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2     # even during node drain, 2 pods must be running
  selector:
    matchLabels:
      app: api

Production Deploy Workflow

# .github/workflows/production-deploy.yml
name: Production Deploy

on:
  workflow_dispatch:      # manual trigger only — no automatic production deploys
    inputs:
      image_sha:
        description: 'Git SHA of the image to deploy (must be tested in staging)'
        required: true
        type: string

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Verify image exists in registry
        run: |
          docker manifest inspect ghcr.io/${{ github.repository }}/api:${{ inputs.image_sha }}

      - name: Verify image was deployed to staging
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          STAGING_SHA=$(kubectl get deployment api \
            --namespace staging \
            -o jsonpath='{.spec.template.spec.containers[0].image}' \
            | cut -d: -f2)

          if [ "$STAGING_SHA" != "${{ inputs.image_sha }}" ]; then
            echo "ERROR: Image ${{ inputs.image_sha }} was not the last staging deploy."
            echo "Staging is running: $STAGING_SHA"
            echo "Deploy to staging first, validate, then promote to production."
            exit 1
          fi

          echo "Staging validation passed. Image matches."

  deploy-production:
    needs: validate
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://api.yourteam.dev
    # production environment requires 2 named approvers in GitHub settings

    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.image_sha }}

      - name: Run database migrations (production)
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          # migrations run before the rolling update starts
          cat << EOF | kubectl apply -f -
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: migrate-${{ inputs.image_sha }}
            namespace: production
          spec:
            ttlSecondsAfterFinished: 3600
            backoffLimit: 0          # no retries — migration failures must be investigated
            template:
              spec:
                restartPolicy: Never
                containers:
                  - name: migrate
                    image: ghcr.io/${{ github.repository }}/api:${{ inputs.image_sha }}
                    command: ["/app/migrate", "up"]
                    envFrom:
                      - secretRef:
                          name: api-secret
                      - configMapRef:
                          name: api-config
          EOF

          kubectl wait job/migrate-${{ inputs.image_sha }} \
            --namespace production \
            --for=condition=complete \
            --timeout=10m

          echo "Migrations complete"

      - name: Deploy to production (rolling update)
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          kubectl set image deployment/api \
            api=ghcr.io/${{ github.repository }}/api:${{ inputs.image_sha }} \
            --namespace production

          kubectl rollout status deployment/api \
            --namespace production \
            --timeout=10m

      - name: Production smoke test
        run: |
          sleep 10
          curl -sf https://api.yourteam.dev/health
          echo "Production smoke test passed"

      - name: Notify on success
        if: success()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployed: `${{ inputs.image_sha }}`\nhttps://api.yourteam.dev"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_PROD_WEBHOOK }}

      - name: Rollback on failure
        if: failure()
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_B64 }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          kubectl rollout undo deployment/api --namespace production
          echo "Rolled back production deployment"

The workflow_dispatch trigger with a required image_sha input means production deployments are always intentional and always reference a specific, validated artifact. Nobody accidentally triggers a production deploy by pushing to a branch.

RBAC: Who Can Touch What

Kubernetes RBAC maps to your team structure. The principle is least privilege: developers can see everything but only modify their own feature namespaces. QA can modify the QA namespace. Only the CI service account can modify staging and production.

# rbac/developer-role.yaml
# Developers: read across all namespaces, write only in feature namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: developer
rules:
  - apiGroups: ["", "apps", "batch"]
    resources: ["pods", "deployments", "services", "jobs", "configmaps"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]    # allow exec for debugging in feature namespaces only
---
# Bind developer ClusterRole for read across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: developers-read
subjects:
  - kind: Group
    name: developers
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: developer
  apiGroup: rbac.authorization.k8s.io

# rbac/qa-role.yaml
# QA: full access to the qa namespace, read elsewhere
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: qa-admin
  namespace: qa
rules:
  - apiGroups: ["", "apps", "batch"]
    resources: ["*"]
    verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: qa-team-binding
  namespace: qa
subjects:
  - kind: Group
    name: qa-engineers
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: qa-admin
  apiGroup: rbac.authorization.k8s.io

# rbac/ci-production-role.yaml
# CI service account: can update deployments in staging and production
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-deploy
  namespace: production
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "patch", "list"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-deploy-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: github-ci
    namespace: kube-system
roleRef:
  kind: Role
  name: ci-deploy
  apiGroup: rbac.authorization.k8s.io

No developer has write access to the staging or production namespaces. This is enforced at the API level — it cannot be bypassed by running kubectl apply manually.

Database Migrations in the Pipeline

Migrations are the most dangerous part of any deployment. A migration that cannot be rolled back can lock you out of a rollback if the new code breaks. The safe pattern: every migration must be backward compatible with the previous version of the application code.

The rules:

Never drop a column or table in the same migration that renames it. First deploy renames (adds new, keeps old). Second deploy removes old.
Never add a NOT NULL column without a default. Add with default → deploy → backfill → make non-null in a third migration.
Never rename a column directly. Add new column → deploy both versions read from both → migrate data → remove old column in a later release.

This allows a safe rollback pattern: if the new deployment breaks, roll back the code but do not roll back the migration. The previous version of the code still works with the new schema because the migration was backward compatible.

# k8s/base/migrations-job.yaml
# run as an init container before the main deployment
# alternative: run as a separate Job before kubectl set image
apiVersion: batch/v1
kind: Job
metadata:
  name: migrate             # name is patched per deploy by kustomize or CI
spec:
  backoffLimit: 3
  activeDeadlineSeconds: 300
  template:
    spec:
      restartPolicy: OnFailure
      initContainers:
        - name: wait-for-db
          image: busybox:1.36
          command: ['sh', '-c', 'until nc -z $DB_HOST 5432; do sleep 2; done']
          env:
            - name: DB_HOST
              valueFrom:
                configMapKeyRef:
                  name: api-config
                  key: db-host
      containers:
        - name: migrate
          image: ghcr.io/yourorg/myapp/api:latest   # patched by CI
          command: ["/app/migrate", "up"]
          envFrom:
            - secretRef:
                name: api-secret
            - configMapRef:
                name: api-config

In the CI pipeline, the sequence is always:

1. Apply migration Job → wait for completion
2. Only if migrations succeed → kubectl set image (rolling update)
3. If rolling update fails → kubectl rollout undo (no migration rollback needed — schema is backward compatible)

GitOps Promotion with Flux

For teams that want Git as the single source of truth for every environment — not just production — Flux can manage the full promotion chain. Each environment has its own Kustomization resource that watches a different path or branch.

git repository structure for GitOps:
infrastructure/
├── clusters/
│   └── team-cluster/
│       ├── flux-system/      ← Flux bootstrap manifests
│       ├── qa.yaml           ← Kustomization watching overlays/qa
│       ├── staging.yaml      ← Kustomization watching overlays/staging
│       └── production.yaml   ← Kustomization watching overlays/production

# infrastructure/clusters/team-cluster/qa.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: myapp-qa
  namespace: flux-system
spec:
  interval: 2m
  path: ./k8s/overlays/qa
  prune: true
  sourceRef:
    kind: GitRepository
    name: myapp
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: api
      namespace: qa
  postBuild:
    substitute:
      ENVIRONMENT: qa

# infrastructure/clusters/team-cluster/production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: myapp-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./k8s/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: myapp
  suspend: false              # set to true to pause auto-sync during incidents
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: api
      namespace: production
  timeout: 10m

With this setup, promoting from staging to production becomes a Git operation — update the image tag in k8s/overlays/production/kustomization.yaml via a PR, get it reviewed and merged, and Flux applies it. The audit trail is the git history.

Observability Per Environment

Each environment needs metrics and logs, but the alerting rules differ. Feature environments generate noise if you alert on them. Production alerts must wake someone up.

# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-alerts
  namespace: monitoring
spec:
  groups:
    - name: api.production
      interval: 30s
      rules:
        # alert only fires for production namespace
        - alert: ProductionHighErrorRate
          expr: |
            sum(rate(http_requests_total{namespace="production", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{namespace="production"}[5m])) > 0.05
          for: 2m
          labels:
            severity: critical
            environment: production
          annotations:
            summary: "Production error rate above 5%"
            runbook: "https://wiki.yourteam.dev/runbooks/high-error-rate"

        - alert: ProductionPodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total{namespace="production"}[15m]) > 3
          for: 5m
          labels:
            severity: critical
            environment: production

    - name: api.staging
      rules:
        # staging alerts: warning only, no paging
        - alert: StagingHighErrorRate
          expr: |
            sum(rate(http_requests_total{namespace="staging", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{namespace="staging"}[5m])) > 0.10
          for: 5m
          labels:
            severity: warning
            environment: staging

In Grafana, create separate dashboards per environment using namespace as a variable:

Dashboard variable: namespace
Values: production, staging, qa
Default: production

Panel query:
  rate(http_requests_total{namespace="$namespace"}[5m])

One dashboard, four environments, one dropdown to switch between them. QA and developers can check their environment’s health without touching production dashboards.

The Full Feature Lifecycle

Putting it all together — the path a feature takes from code to production:

1. Developer creates feature branch from main
   → PR opened
   → GitHub Actions builds image, creates namespace feature-pr-142
   → Ephemeral PostgreSQL deployed, test data seeded
   → Preview URL posted as PR comment: https://pr-142.preview.yourteam.dev
   → Developer iterates, pushes commits, CI rebuilds and redeploys

2. PR review
   → Code review by peers
   → Developer can share preview URL with stakeholders for early feedback
   → Tests run in CI (unit, integration) — must pass before merge

3. PR merged to main
   → Namespace feature-pr-142 deleted (all resources cascade)
   → CI builds final image for the commit SHA
   → Image pushed to GHCR with SHA tag
   → QA deploy workflow triggers
   → Migrations run against QA database
   → QA namespace updated to new image
   → Slack notification to QA channel

4. QA validation
   → QA engineer tests qa.yourteam.dev
   → Runs regression suite
   → Files bugs as issues if found (back to step 1 for fixes)
   → When satisfied, approves the GitHub 'staging' environment

5. Staging deploy (requires QA approval)
   → Pipeline resumes after approval
   → Migrations run against staging database
   → Staging namespace updated
   → Automated smoke tests run
   → Slack notification to team channel
   → Release manager validates staging.yourteam.dev

6. Production deploy (manual, requires image SHA)
   → Release manager triggers production workflow manually
   → Specifies the image SHA validated in staging
   → Pipeline validates SHA matches staging deployment
   → GitHub 'production' environment requires 2 approvers
   → After approval: migrations run, rolling update begins
   → Zero downtime: maxUnavailable=0 guarantees at least 3 pods always up
   → Smoke tests verify production endpoint
   → Monitoring watches for error rate increase for 30 minutes post-deploy
   → If error rate spikes: automatic rollback via kubectl rollout undo

This model — ephemeral feature environments, stable QA, production-mirror staging, locked production — is not about process for its own sake. It exists because the cost of a production bug is orders of magnitude higher than the cost of finding it in QA or staging. Every gate is a risk reduction mechanism. Every piece of automation removes a manual step that would otherwise be skipped under deadline pressure.

The best time to find a bug is before it reaches production. The environment model exists to make that possible — every time, not just when someone remembers to test.