K3s DevOps: IaC, Secrets, SLOs, Security, and Disaster Recovery

Infrastructure is a product. It has users (your engineering team), uptime requirements (your SLAs), a security posture (your compliance obligations), and a lifecycle (your upgrade and DR plan). Most Kubernetes guides stop at deploying a workload. This one starts where they stop.

The previous posts in this series covered K3s fundamentals and agile team environment lifecycles. This post focuses on what the DevOps engineer actually owns after the cluster is running: provisioning it repeatably, managing secrets safely, defining and defending service levels, hardening the security posture, and making sure a failure does not become a disaster.

Every section is written for the person who has to answer to an incident postmortem, not just the person who wants to get something running.

Infrastructure as Code: Provisioning K3s with Terraform

A cluster you provisioned by hand is a cluster you cannot reproduce under pressure. When the VPS burns at 2 AM on a Saturday, the question is not “how do I install K3s” — it is “how fast can I get the identical cluster back.” The answer is Terraform.

The pattern is a layered stack: Terraform provisions the machines and network, a cloud-init script installs K3s, and Ansible handles post-installation configuration (kubeconfig download, certificate rotation, firewall rules). Each layer is independently testable and replaceable.

Terraform Module Structure

# main.tf — provisions a single-node K3s server on Hetzner Cloud
terraform {
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "~> 1.45"
    }
  }
  backend "s3" {
    bucket = "your-tfstate-bucket"
    key    = "k3s/production/terraform.tfstate"
    region = "eu-central-1"
    # use a real backend — never commit state to git
  }
}

resource "hcloud_server" "k3s_server" {
  name        = "k3s-${var.environment}"
  server_type = var.server_type   # cx22 for staging, cx32 for production
  image       = "ubuntu-24.04"
  location    = var.location
  ssh_keys    = [hcloud_ssh_key.deploy.id]
  user_data   = templatefile("${path.module}/cloud-init.yaml.tpl", {
    k3s_version  = var.k3s_version
    k3s_token    = random_password.k3s_token.result
    environment  = var.environment
    extra_args   = var.k3s_extra_args
  })

  lifecycle {
    prevent_destroy = var.environment == "production" ? true : false
  }
}

resource "hcloud_firewall" "k3s" {
  name = "k3s-${var.environment}"

  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "6443"
    source_ips = var.allowed_cidr_blocks   # only your CI runners and VPN
  }
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "80"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "443"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
}

resource "random_password" "k3s_token" {
  length  = 64
  special = false
}

output "k3s_token" {
  value     = random_password.k3s_token.result
  sensitive = true
}

output "server_ipv4" {
  value = hcloud_server.k3s_server.ipv4_address
}

# cloud-init.yaml.tpl — runs once on first boot
#cloud-config
package_update: true
packages:
  - curl
  - jq
  - fail2ban
  - ufw

write_files:
  - path: /etc/rancher/k3s/config.yaml
    content: |
      token: "${k3s_token}"
      tls-san:
        - "${server_ip}"
      disable:
        - traefik       # we install it separately via Helm for version control
      kube-apiserver-arg:
        - "audit-log-path=/var/log/k3s-audit.log"
        - "audit-log-maxage=30"
        - "audit-log-maxbackup=3"
        - "audit-log-maxsize=100"
        - "audit-policy-file=/etc/rancher/k3s/audit-policy.yaml"
      kubelet-arg:
        - "protect-kernel-defaults=true"
        - "event-qps=0"

runcmd:
  - ufw default deny incoming
  - ufw default allow outgoing
  - ufw allow 22/tcp
  - ufw allow 80/tcp
  - ufw allow 443/tcp
  - ufw allow 6443/tcp
  - ufw --force enable
  - curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="${k3s_version}" sh -
  - systemctl enable k3s

Variables and Environments

# variables.tf
variable "environment" {
  type        = string
  description = "staging or production"
  validation {
    condition     = contains(["staging", "production"], var.environment)
    error_message = "Environment must be staging or production."
  }
}

variable "k3s_version" {
  type    = string
  default = "v1.29.3+k3s1"
  # pin the version — never use 'latest' in production
}

variable "k3s_extra_args" {
  type    = list(string)
  default = []
}

variable "server_type" {
  type    = string
  default = "cx22"
}

variable "allowed_cidr_blocks" {
  type      = list(string)
  sensitive = true
}

The prevent_destroy lifecycle rule on the production server forces Terraform to error if someone tries to destroy and recreate it. Rebuilding production requires removing that flag explicitly — an intentional friction that prevents accidents.

Secrets Management

Secrets in Kubernetes have a fundamental problem: a Secret resource is just a base64-encoded ConfigMap. Anyone with cluster read access can decode it. Committing the raw YAML to git is equivalent to committing plain-text passwords.

There are two production-grade solutions. The right choice depends on whether you control your secret store or delegate it to a cloud provider.

Option 1: Sealed Secrets (self-hosted)

Sealed Secrets encrypts your secret with a public key that only the controller in your cluster can decrypt. You commit the encrypted SealedSecret to git safely.

# install the controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets \
  --namespace kube-system \
  --set fullnameOverride=sealed-secrets-controller

# install kubeseal CLI
brew install kubeseal  # or download the binary

# fetch the cluster public key
kubeseal --fetch-cert \
  --controller-name=sealed-secrets-controller \
  --controller-namespace=kube-system \
  > pub-sealed-secrets.pem

# seal a secret — the output is safe to commit
kubectl create secret generic db-credentials \
  --from-literal=password=supersecret \
  --dry-run=client \
  -o yaml \
  | kubeseal \
    --cert pub-sealed-secrets.pem \
    --format yaml \
  > k8s/base/db-credentials-sealed.yaml

The resulting SealedSecret manifest looks like this:

apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  encryptedData:
    password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
  template:
    metadata:
      name: db-credentials
      namespace: production
    type: Opaque

When the controller sees this resource, it decrypts it and creates the corresponding Secret in the cluster. The plaintext never touches git.

Key rotation: when you need to rotate the sealing key itself (e.g., after a security incident), generate a new key pair, re-encrypt all secrets, and replace the controller key. The old key can be kept for decryption of previously-sealed secrets during the transition.

# rotate the sealing key
kubectl -n kube-system delete secret sealed-secrets-key
# the controller generates a new key on restart
kubectl -n kube-system rollout restart deployment sealed-secrets-controller
# fetch the new public key and re-seal all secrets

Option 2: External Secrets Operator (ESO)

ESO pulls secrets from an external store (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault, GCP Secret Manager) and creates Kubernetes Secret objects in the cluster. The external store is the source of truth. No secrets ever live in git.

helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets \
  --namespace external-secrets \
  --create-namespace

# SecretStore — connects ESO to AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: production
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        secretRef:
          accessKeyIDSecretRef:
            name: aws-credentials
            key: access-key-id
          secretAccessKeySecretRef:
            name: aws-credentials
            key: secret-access-key
---
# ExternalSecret — pulls a specific secret from AWS and creates a K8s Secret
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: db-credentials
    creationPolicy: Owner
  data:
    - secretKey: password
      remoteRef:
        key: production/db-credentials
        property: password

ESO re-syncs the secret every refreshInterval. If the value changes in the external store, the Kubernetes Secret is updated automatically. Pods that mount secrets as volumes see the update within the kubelet’s sync period (default 60s). Pods that use environment variables need a rollout.

When to choose which:

Sealed Secrets: small team, no cloud provider dependency, simple rotation cadence
ESO: regulated environment, existing Vault/Secrets Manager investment, automated secret rotation required

SLO, SLI, and Error Budgets

An SLO (Service Level Objective) is a commitment. “99.5% of HTTP requests return 2xx within 500ms, measured over a 30-day rolling window.” It is not a target you set once and forget — it is the number your team uses to decide whether to ship a risky change or spend the next sprint on reliability.

The SLI (Service Level Indicator) is how you measure it. The error budget is what you have left to spend before you breach the SLO.

Defining SLIs in Prometheus

# PrometheusRule — defines the recording rules and alerting rules for your SLO
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slo
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: slo.api.availability
      interval: 30s
      rules:
        # SLI: ratio of successful requests over total requests
        - record: job:http_requests_total:rate5m
          expr: rate(http_requests_total[5m])

        - record: job:http_request_errors:rate5m
          expr: rate(http_requests_total{status=~"5.."}[5m])

        - record: job:http_availability:ratio5m
          expr: |
            1 - (
              sum(job:http_request_errors:rate5m)
              /
              sum(job:http_requests_total:rate5m)
            )

        # 30-day availability ratio (used for error budget calculation)
        - record: job:http_availability:ratio30d
          expr: |
            1 - (
              sum_over_time(job:http_request_errors:rate5m[30d])
              /
              sum_over_time(job:http_requests_total:rate5m[30d])
            )

    - name: slo.api.latency
      interval: 30s
      rules:
        # SLI: ratio of requests completing under 500ms
        - record: job:http_latency_fast:ratio5m
          expr: |
            sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
            /
            sum(rate(http_request_duration_seconds_count[5m]))

    - name: slo.api.alerts
      rules:
        # Multi-window burn rate alert — fires when error budget burns too fast
        - alert: SLOErrorBudgetBurnRateHigh
          expr: |
            (
              job:http_availability:ratio5m < (1 - 14.4 * (1 - 0.995))
              and
              job:http_availability:ratio1h < (1 - 14.4 * (1 - 0.995))
            )
          for: 2m
          labels:
            severity: critical
            team: platform
          annotations:
            summary: "API error budget burning at 14.4x rate — SLO breach in < 1 hour"
            runbook: "https://wiki.internal/runbooks/api-slo-burn"

        - alert: SLOErrorBudgetBurnRateMedium
          expr: |
            (
              job:http_availability:ratio30m < (1 - 6 * (1 - 0.995))
              and
              job:http_availability:ratio6h < (1 - 6 * (1 - 0.995))
            )
          for: 15m
          labels:
            severity: warning
            team: platform
          annotations:
            summary: "API error budget burning at 6x rate — investigate before weekend"

The burn rate multipliers (14.4x and 6x) come from the Google SRE Workbook. A 14.4x burn rate means you will exhaust the full monthly error budget in 50 minutes if it continues. This is the threshold for a critical page — wake someone up.

Error Budget Policy

The error budget is not just a metric. It is a decision-making framework. Document it explicitly:

## API Service Error Budget Policy

SLO: 99.5% availability over 30 days
Error budget: 0.5% of requests = ~3.6 hours of full outage per month

### When budget is > 50% remaining
- Normal feature development and deployments permitted
- Risky infrastructure changes permitted with review

### When budget is 25–50% remaining
- Feature freeze on changes that affect request path
- All deployments require two approvers
- On-call rotation increased to 30-minute response SLA

### When budget is < 25% remaining
- Full feature freeze
- Engineering focus shifts to reliability exclusively
- No infrastructure changes without incident commander approval

### When budget is exhausted
- Production deployments suspended until budget recovers
- Post-incident review required before resuming normal operations

Grafana Dashboard for Error Budget

{
  "title": "SLO Error Budget",
  "panels": [
    {
      "title": "30-day Availability",
      "type": "stat",
      "targets": [
        {
          "expr": "job:http_availability:ratio30d * 100",
          "legendFormat": "Availability %"
        }
      ],
      "thresholds": {
        "steps": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 99},
          {"color": "green", "value": 99.5}
        ]
      }
    },
    {
      "title": "Error Budget Remaining",
      "type": "gauge",
      "targets": [
        {
          "expr": "(job:http_availability:ratio30d - 0.995) / (1 - 0.995) * 100",
          "legendFormat": "Budget %"
        }
      ]
    }
  ]
}

Security Hardening

A default K3s installation is functional but not hardened. The distance between “works” and “secure” is exactly where breaches happen. Security hardening is not a one-time task — it is a set of controls you implement, test, and maintain.

Pod Security Standards

Kubernetes 1.25+ replaced PodSecurityPolicies with Pod Security Standards. Enforce the restricted profile on production namespaces and baseline on everything else.

# Apply labels to enforce Pod Security Standards per namespace
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest

What restricted enforces:

No privileged containers
No privilege escalation (allowPrivilegeEscalation: false)
Containers must run as non-root
Root filesystem must be read-only
All capabilities dropped, only specific ones re-added if needed
Seccomp profile must be RuntimeDefault or Localhost

A deployment that passes restricted validation looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: production
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: api
          image: ghcr.io/your-org/api:sha-abc123
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: ["ALL"]
          volumeMounts:
            - name: tmp
              mountPath: /tmp      # writable tmpfs for apps that need it
            - name: cache
              mountPath: /app/cache
      volumes:
        - name: tmp
          emptyDir: {}
        - name: cache
          emptyDir: {}

Network Policies

By default, every pod in a Kubernetes cluster can reach every other pod across all namespaces. This is wrong for production. Define an explicit allow-list using NetworkPolicy.

# Default deny all ingress and egress in production
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
# Allow ingress to the API only from Traefik
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-from-traefik
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              app.kubernetes.io/name: traefik
      ports:
        - port: 8080
---
# Allow the API to reach PostgreSQL
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-postgres
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: postgres
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: api
      ports:
        - port: 5432
---
# Allow DNS resolution for all pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
---
# Allow the API to reach external services (HTTPS only)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-egress-https
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Egress
  egress:
    - ports:
        - port: 443

Audit Logging

The audit policy in the cloud-init config above is incomplete without the actual policy file. Define it explicitly — it is the record you need after a breach.

# /etc/rancher/k3s/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Log all requests to secrets — full body on creation/update
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["secrets"]
    verbs: ["create", "update", "patch", "delete"]

  # Log exec and port-forward operations — high privilege, high risk
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods/exec", "pods/portforward", "pods/attach"]

  # Log RBAC changes
  - level: RequestResponse
    resources:
      - group: "rbac.authorization.k8s.io"
        resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]

  # Log authentication failures
  - level: Metadata
    omitStages:
      - RequestReceived
    users: ["system:anonymous"]

  # Reduce noise from read-only operations on common resources
  - level: None
    resources:
      - group: ""
        resources: ["configmaps", "endpoints", "services"]
    verbs: ["get", "list", "watch"]

  # Default: log metadata for everything else
  - level: Metadata
    omitStages:
      - RequestReceived

Image Security

Never pull :latest. Pin to a digest or a SHA-tagged image. Use a policy engine to enforce this.

# Kyverno ClusterPolicy — blocks latest tag and requires digest in production
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-digest
spec:
  validationFailureAction: Enforce
  background: false
  rules:
    - name: check-image-tag
      match:
        any:
          - resources:
              kinds: ["Pod"]
              namespaces: ["production", "staging"]
      validate:
        message: "Production images must use a digest or SHA tag, not :latest or a mutable tag."
        pattern:
          spec:
            containers:
              - image: "*:sha-*"

Install Kyverno via Helm:

helm repo add kyverno https://kyverno.github.io/kyverno/
helm install kyverno kyverno/kyverno \
  --namespace kyverno \
  --create-namespace \
  --set replicaCount=1   # single replica for K3s; use 3 for HA

CIS Benchmark Validation

Run kube-bench against the cluster to measure CIS compliance. This is the tool auditors expect to see results from.

kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl wait --for=condition=complete job/kube-bench --timeout=120s
kubectl logs job/kube-bench

Key checks that K3s fails by default and how to fix them:

Check	Default State	Fix
API server audit logging	Disabled	Add `audit-log-path` to k3s config
Anonymous authentication	Enabled	Add `--anonymous-auth=false` kubelet arg
Read-only port	10255 open	Add `--read-only-port=0` kubelet arg
Protect kernel defaults	Not set	Add `--protect-kernel-defaults=true` kubelet arg
Event rate limiting	Not set	Add `--event-qps=0` kubelet arg

Resource Optimization

Kubernetes gives you the tools to describe resource requirements. Most teams skip this step. The result is either a cluster where pods starve each other during peak load, or a cluster where you pay for three times the capacity you actually need.

Requests vs Limits

The critical distinction: requests determines scheduling (where the pod lands). limits determines enforcement (what happens when it exceeds the threshold). A pod with limits.cpu: 500m and requests.cpu: 100m can burst to 500m on a node with spare capacity, but is guaranteed 100m.

resources:
  requests:
    cpu: "100m"      # guaranteed allocation for scheduling
    memory: "128Mi"  # guaranteed allocation for scheduling
  limits:
    cpu: "500m"      # max burst — throttled if exceeded (not killed)
    memory: "256Mi"  # hard limit — OOMKilled if exceeded

Never set limits.memory below requests.memory. Never omit requests — without them, the scheduler has no data and places pods randomly.

VPA for Automatic Tuning

The Vertical Pod Autoscaler observes actual resource usage and recommends (or applies) better requests values. Use it in Off mode first to get recommendations without side effects.

# install VPA
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"   # Recommend only — do not mutate pods automatically
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: "50m"
          memory: "64Mi"
        maxAllowed:
          cpu: "2000m"
          memory: "1Gi"

After a week of observations, check recommendations:

kubectl describe vpa api-vpa -n production
# Look for "Recommendation:" section
# Lower bound = minimum safe, Target = recommended, Upper bound = spike headroom

Apply the recommended values to your Deployment manifest. Re-run the VPA in Off mode after each significant traffic change.

LimitRange and ResourceQuota per Namespace

Define defaults so that any pod without explicit requests still gets reasonable values:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - type: Container
      default:
        cpu: "200m"
        memory: "256Mi"
      defaultRequest:
        cpu: "50m"
        memory: "64Mi"
      max:
        cpu: "2000m"
        memory: "2Gi"
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "8"
    requests.memory: "16Gi"
    limits.cpu: "16"
    limits.memory: "32Gi"
    pods: "50"
    services: "20"
    persistentvolumeclaims: "10"

Disaster Recovery

Disaster recovery is the practice of defining what you can survive and building the systems that let you survive it. The two numbers that matter are RTO (Recovery Time Objective — how long you can be down) and RPO (Recovery Point Objective — how much data you can lose).

Define these before building the DR system, not after. “We’ll figure it out” is not a DR plan.

State Inventory

Before you can recover, you need to know what state exists and where it lives:

State	Location	Backup Strategy	RPO
Cluster config (CRDs, RBAC, deployments)	etcd / SQLite	`etcdctl snapshot` or SQLite backup	1 hour
Application database	PostgreSQL StatefulSet or external	pg_dump + WAL streaming	5 minutes
Uploaded files / object store	S3 or Longhorn PVC	Cross-region replication	15 minutes
Secrets	Sealed Secrets in git	Git history	Immediate
Container images	ghcr.io / Docker Hub	Pulled fresh from registry	Immediate

K3s Datastore Backup

For single-node K3s (SQLite), the datastore is a file at /var/lib/rancher/k3s/server/db/state.db. Back it up with a systemd timer:

# /etc/systemd/system/k3s-backup.service
[Unit]
Description=K3s SQLite Backup
After=k3s.service

[Service]
Type=oneshot
User=root
ExecStart=/usr/local/bin/k3s-backup.sh

# /usr/local/bin/k3s-backup.sh

#!/bin/bash
set -euo pipefail

BACKUP_DIR="/opt/k3s-backups"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="$BACKUP_DIR/k3s-state-$DATE.db"

mkdir -p "$BACKUP_DIR"

# stop writes temporarily for a consistent snapshot
systemctl stop k3s
cp /var/lib/rancher/k3s/server/db/state.db "$BACKUP_FILE"
systemctl start k3s

# compress
gzip "$BACKUP_FILE"

# upload to S3 (requires aws CLI configured)
aws s3 cp "$BACKUP_FILE.gz" "s3://your-backup-bucket/k3s/$(hostname)/$DATE.db.gz"

# keep only last 7 days locally
find "$BACKUP_DIR" -name "*.gz" -mtime +7 -delete

# /etc/systemd/system/k3s-backup.timer
[Unit]
Description=K3s SQLite Backup Timer

[Timer]
OnCalendar=*:0/30     # every 30 minutes
Persistent=true

[Install]
WantedBy=timers.target

For HA K3s with embedded etcd, use the built-in snapshot:

# manual snapshot
k3s etcd-snapshot save --name pre-upgrade-$(date +%Y%m%d)

# list snapshots
k3s etcd-snapshot list

# restore from snapshot (cluster must be stopped)
systemctl stop k3s
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/pre-upgrade-20260306.db
systemctl start k3s

Configure automatic etcd snapshots in k3s config:

# /etc/rancher/k3s/config.yaml
etcd-snapshot-schedule-cron: "*/30 * * * *"
etcd-snapshot-retention: 96      # keep 96 snapshots = 2 days at 30-minute intervals
etcd-snapshot-dir: /opt/k3s-snapshots

Application Database DR

PostgreSQL WAL archiving gives you point-in-time recovery to within seconds. With pgBackRest or WAL-G:

# pgBackRest backup from within the postgres pod
kubectl exec -it postgres-0 -n production -- \
  pgbackrest --stanza=main backup --type=full

# WAL-G continuous archiving setup (add to postgresql.conf)
# archive_mode = on
# archive_command = 'wal-g wal-push %p'
# restore_command = 'wal-g wal-fetch %f %p'

Cluster Restore Runbook

Document the sequence before you need it. The runbook is the artifact that makes the difference between a 30-minute recovery and a 4-hour recovery.

## K3s Cluster Restore Runbook

### Preconditions
- Access to: S3 backup bucket, DNS provider, new VPS
- Required secrets: K3s token (in Vault), DB password (in Vault), TLS cert private key

### Step 1: Provision new node (Terraform)
```bash
cd infrastructure/terraform
terraform apply -var="environment=production" -target=hcloud_server.k3s_server
# wait for cloud-init to complete: ssh root@<new-ip> journalctl -f -u cloud-final

Step 2: Restore K3s datastore

# download latest backup
aws s3 cp s3://your-backup-bucket/k3s/latest.db.gz /tmp/
gunzip /tmp/latest.db.gz
systemctl stop k3s
cp /tmp/latest.db /var/lib/rancher/k3s/server/db/state.db
systemctl start k3s

Step 3: Verify cluster

kubectl get nodes     # should show Ready
kubectl get pods -A   # should show pods recovering

Step 4: Update DNS

Point your domain A record to the new VPS IP. TTL should be 60s in production (set it before an incident, not during).

Step 5: Verify application

Run smoke tests against the new IP before cutting DNS.

Expected RTO: 25 minutes

Testing DR

A DR plan you have never tested is a DR plan that will fail. Run a full DR drill quarterly:

# simulate a full cluster loss on staging
terraform destroy -target=hcloud_server.k3s_server -var="environment=staging"

# start the timer
# run the runbook from step 1
# measure actual RTO vs target RTO

# document in the incident log what was slower than expected
# update the runbook before the next quarter

Upgrade Strategy

K3s releases follow Kubernetes upstream with a 2-4 week lag. Running more than 2 minor versions behind means you are missing security patches. Running on a version the K3s team no longer supports means you are on your own.

The system-upgrade-controller automates rolling upgrades across your nodes:

# Plan CRD — upgrades server nodes to a specific K3s version
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: k3s-server-upgrade
  namespace: system-upgrade
spec:
  concurrency: 1        # upgrade one node at a time
  cordon: true          # cordon node before upgrade, uncordon after
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  channel: https://update.k3s.io/v1-release/channels/stable
  # or pin to a version:
  # version: v1.29.3+k3s1
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
  tolerations:
    - key: node-role.kubernetes.io/control-plane
      effect: NoSchedule
      operator: Exists

Before any cluster upgrade:

Take a datastore snapshot
Check the Kubernetes changelog for API deprecations
Validate your Helm chart versions against the new API versions
Run the upgrade on staging first — let it soak for 48 hours
Apply to production during low-traffic window with a rollback plan documented

Observability as Infrastructure

The observability stack is not optional infrastructure. It is the system that tells you whether your SLOs are being met, whether your security policies are working, and whether you have enough capacity. It deserves the same reliability and operational rigor as the applications it monitors.

Structured Logging Pipeline

Applications should emit structured JSON logs. The collection pipeline — Promtail → Loki → Grafana — should have its own resource quotas and backup.

# DaemonSet for Promtail — collects logs from all nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: promtail
  template:
    spec:
      serviceAccountName: promtail
      tolerations:
        - effect: NoSchedule
          operator: Exists
      containers:
        - name: promtail
          image: grafana/promtail:2.9.4
          args:
            - -config.file=/etc/promtail/promtail.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/promtail
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
          resources:
            requests:
              cpu: "50m"
              memory: "64Mi"
            limits:
              cpu: "200m"
              memory: "128Mi"
      volumes:
        - name: config
          configMap:
            name: promtail-config
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers

Alertmanager Routing

Alerts that go to the wrong channel get ignored. Route by severity and namespace:

# alertmanager.yaml — route critical production alerts to PagerDuty, rest to Slack
global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "namespace", "severity"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: "slack-default"
  routes:
    - match:
        severity: critical
        namespace: production
      receiver: "pagerduty-production"
      continue: true   # also send to Slack
    - match:
        severity: critical
        namespace: staging
      receiver: "slack-oncall"
    - match:
        severity: warning
      receiver: "slack-alerts"

receivers:
  - name: "pagerduty-production"
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_KEY}"
        description: '{{ template "pagerduty.default.description" . }}'
  - name: "slack-oncall"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL}"
        channel: "#oncall"
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.summary }}\nRunbook: {{ .Annotations.runbook }}{{ end }}"
  - name: "slack-alerts"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL}"
        channel: "#alerts"

The Infrastructure Mindset

Kubernetes clusters are not pets. The goal of every practice in this post — IaC, secrets management, SLOs, hardening, DR — is to make the cluster boring. A boring cluster is one that provisions identically every time, fails predictably, recovers automatically, and never surprises the on-call engineer at 3 AM.

The measure of a mature infrastructure is not uptime. Uptime is a lagging indicator. The measure is time-to-restore after failure. A team that can rebuild the full cluster in 25 minutes from a known-good backup can tolerate catastrophic failures. A team that has never tested their DR procedure cannot.

The on-call engineer who never had to use the runbook is the engineer who will be lost when the runbook matters. Test your recovery. The drill is the work.