K3s DevOps: IaC, Secrets, SLOs, Security, and Disaster Recovery
Deep dive into K3s from the DevOps operator perspective: Terraform provisioning, secrets management, SLO/error budgets, CIS hardening, and DR with real RTO/RPO targets.
Infrastructure is a product. It has users (your engineering team), uptime requirements (your SLAs), a security posture (your compliance obligations), and a lifecycle (your upgrade and DR plan). Most Kubernetes guides stop at deploying a workload. This one starts where they stop.
The previous posts in this series covered K3s fundamentals and agile team environment lifecycles. This post focuses on what the DevOps engineer actually owns after the cluster is running: provisioning it repeatably, managing secrets safely, defining and defending service levels, hardening the security posture, and making sure a failure does not become a disaster.
Every section is written for the person who has to answer to an incident postmortem, not just the person who wants to get something running.
Infrastructure as Code: Provisioning K3s with Terraform
A cluster you provisioned by hand is a cluster you cannot reproduce under pressure. When the VPS burns at 2 AM on a Saturday, the question is not βhow do I install K3sβ β it is βhow fast can I get the identical cluster back.β The answer is Terraform.
The pattern is a layered stack: Terraform provisions the machines and network, a cloud-init script installs K3s, and Ansible handles post-installation configuration (kubeconfig download, certificate rotation, firewall rules). Each layer is independently testable and replaceable.
Terraform Module Structure
# main.tf β provisions a single-node K3s server on Hetzner Cloud
terraform {
required_providers {
hcloud = {
source = "hetznercloud/hcloud"
version = "~> 1.45"
}
}
backend "s3" {
bucket = "your-tfstate-bucket"
key = "k3s/production/terraform.tfstate"
region = "eu-central-1"
# use a real backend β never commit state to git
}
}
resource "hcloud_server" "k3s_server" {
name = "k3s-${var.environment}"
server_type = var.server_type # cx22 for staging, cx32 for production
image = "ubuntu-24.04"
location = var.location
ssh_keys = [hcloud_ssh_key.deploy.id]
user_data = templatefile("${path.module}/cloud-init.yaml.tpl", {
k3s_version = var.k3s_version
k3s_token = random_password.k3s_token.result
environment = var.environment
extra_args = var.k3s_extra_args
})
lifecycle {
prevent_destroy = var.environment == "production" ? true : false
}
}
resource "hcloud_firewall" "k3s" {
name = "k3s-${var.environment}"
rule {
direction = "in"
protocol = "tcp"
port = "6443"
source_ips = var.allowed_cidr_blocks # only your CI runners and VPN
}
rule {
direction = "in"
protocol = "tcp"
port = "80"
source_ips = ["0.0.0.0/0", "::/0"]
}
rule {
direction = "in"
protocol = "tcp"
port = "443"
source_ips = ["0.0.0.0/0", "::/0"]
}
}
resource "random_password" "k3s_token" {
length = 64
special = false
}
output "k3s_token" {
value = random_password.k3s_token.result
sensitive = true
}
output "server_ipv4" {
value = hcloud_server.k3s_server.ipv4_address
}
# cloud-init.yaml.tpl β runs once on first boot
#cloud-config
package_update: true
packages:
- curl
- jq
- fail2ban
- ufw
write_files:
- path: /etc/rancher/k3s/config.yaml
content: |
token: "${k3s_token}"
tls-san:
- "${server_ip}"
disable:
- traefik # we install it separately via Helm for version control
kube-apiserver-arg:
- "audit-log-path=/var/log/k3s-audit.log"
- "audit-log-maxage=30"
- "audit-log-maxbackup=3"
- "audit-log-maxsize=100"
- "audit-policy-file=/etc/rancher/k3s/audit-policy.yaml"
kubelet-arg:
- "protect-kernel-defaults=true"
- "event-qps=0"
runcmd:
- ufw default deny incoming
- ufw default allow outgoing
- ufw allow 22/tcp
- ufw allow 80/tcp
- ufw allow 443/tcp
- ufw allow 6443/tcp
- ufw --force enable
- curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="${k3s_version}" sh -
- systemctl enable k3s
Variables and Environments
# variables.tf
variable "environment" {
type = string
description = "staging or production"
validation {
condition = contains(["staging", "production"], var.environment)
error_message = "Environment must be staging or production."
}
}
variable "k3s_version" {
type = string
default = "v1.29.3+k3s1"
# pin the version β never use 'latest' in production
}
variable "k3s_extra_args" {
type = list(string)
default = []
}
variable "server_type" {
type = string
default = "cx22"
}
variable "allowed_cidr_blocks" {
type = list(string)
sensitive = true
}
The prevent_destroy lifecycle rule on the production server forces Terraform to error if someone tries to destroy and recreate it. Rebuilding production requires removing that flag explicitly β an intentional friction that prevents accidents.
Secrets Management
Secrets in Kubernetes have a fundamental problem: a Secret resource is just a base64-encoded ConfigMap. Anyone with cluster read access can decode it. Committing the raw YAML to git is equivalent to committing plain-text passwords.
There are two production-grade solutions. The right choice depends on whether you control your secret store or delegate it to a cloud provider.
Option 1: Sealed Secrets (self-hosted)
Sealed Secrets encrypts your secret with a public key that only the controller in your cluster can decrypt. You commit the encrypted SealedSecret to git safely.
# install the controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets \
--namespace kube-system \
--set fullnameOverride=sealed-secrets-controller
# install kubeseal CLI
brew install kubeseal # or download the binary
# fetch the cluster public key
kubeseal --fetch-cert \
--controller-name=sealed-secrets-controller \
--controller-namespace=kube-system \
> pub-sealed-secrets.pem
# seal a secret β the output is safe to commit
kubectl create secret generic db-credentials \
--from-literal=password=supersecret \
--dry-run=client \
-o yaml \
| kubeseal \
--cert pub-sealed-secrets.pem \
--format yaml \
> k8s/base/db-credentials-sealed.yaml
The resulting SealedSecret manifest looks like this:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: db-credentials
namespace: production
spec:
encryptedData:
password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
template:
metadata:
name: db-credentials
namespace: production
type: Opaque
When the controller sees this resource, it decrypts it and creates the corresponding Secret in the cluster. The plaintext never touches git.
Key rotation: when you need to rotate the sealing key itself (e.g., after a security incident), generate a new key pair, re-encrypt all secrets, and replace the controller key. The old key can be kept for decryption of previously-sealed secrets during the transition.
# rotate the sealing key
kubectl -n kube-system delete secret sealed-secrets-key
# the controller generates a new key on restart
kubectl -n kube-system rollout restart deployment sealed-secrets-controller
# fetch the new public key and re-seal all secrets
Option 2: External Secrets Operator (ESO)
ESO pulls secrets from an external store (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault, GCP Secret Manager) and creates Kubernetes Secret objects in the cluster. The external store is the source of truth. No secrets ever live in git.
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets \
--namespace external-secrets \
--create-namespace
# SecretStore β connects ESO to AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets-manager
namespace: production
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
secretRef:
accessKeyIDSecretRef:
name: aws-credentials
key: access-key-id
secretAccessKeySecretRef:
name: aws-credentials
key: secret-access-key
---
# ExternalSecret β pulls a specific secret from AWS and creates a K8s Secret
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: db-credentials
creationPolicy: Owner
data:
- secretKey: password
remoteRef:
key: production/db-credentials
property: password
ESO re-syncs the secret every refreshInterval. If the value changes in the external store, the Kubernetes Secret is updated automatically. Pods that mount secrets as volumes see the update within the kubeletβs sync period (default 60s). Pods that use environment variables need a rollout.
When to choose which:
- Sealed Secrets: small team, no cloud provider dependency, simple rotation cadence
- ESO: regulated environment, existing Vault/Secrets Manager investment, automated secret rotation required
SLO, SLI, and Error Budgets
An SLO (Service Level Objective) is a commitment. β99.5% of HTTP requests return 2xx within 500ms, measured over a 30-day rolling window.β It is not a target you set once and forget β it is the number your team uses to decide whether to ship a risky change or spend the next sprint on reliability.
The SLI (Service Level Indicator) is how you measure it. The error budget is what you have left to spend before you breach the SLO.
Defining SLIs in Prometheus
# PrometheusRule β defines the recording rules and alerting rules for your SLO
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-slo
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: slo.api.availability
interval: 30s
rules:
# SLI: ratio of successful requests over total requests
- record: job:http_requests_total:rate5m
expr: rate(http_requests_total[5m])
- record: job:http_request_errors:rate5m
expr: rate(http_requests_total{status=~"5.."}[5m])
- record: job:http_availability:ratio5m
expr: |
1 - (
sum(job:http_request_errors:rate5m)
/
sum(job:http_requests_total:rate5m)
)
# 30-day availability ratio (used for error budget calculation)
- record: job:http_availability:ratio30d
expr: |
1 - (
sum_over_time(job:http_request_errors:rate5m[30d])
/
sum_over_time(job:http_requests_total:rate5m[30d])
)
- name: slo.api.latency
interval: 30s
rules:
# SLI: ratio of requests completing under 500ms
- record: job:http_latency_fast:ratio5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
- name: slo.api.alerts
rules:
# Multi-window burn rate alert β fires when error budget burns too fast
- alert: SLOErrorBudgetBurnRateHigh
expr: |
(
job:http_availability:ratio5m < (1 - 14.4 * (1 - 0.995))
and
job:http_availability:ratio1h < (1 - 14.4 * (1 - 0.995))
)
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "API error budget burning at 14.4x rate β SLO breach in < 1 hour"
runbook: "https://wiki.internal/runbooks/api-slo-burn"
- alert: SLOErrorBudgetBurnRateMedium
expr: |
(
job:http_availability:ratio30m < (1 - 6 * (1 - 0.995))
and
job:http_availability:ratio6h < (1 - 6 * (1 - 0.995))
)
for: 15m
labels:
severity: warning
team: platform
annotations:
summary: "API error budget burning at 6x rate β investigate before weekend"
The burn rate multipliers (14.4x and 6x) come from the Google SRE Workbook. A 14.4x burn rate means you will exhaust the full monthly error budget in 50 minutes if it continues. This is the threshold for a critical page β wake someone up.
Error Budget Policy
The error budget is not just a metric. It is a decision-making framework. Document it explicitly:
## API Service Error Budget Policy
SLO: 99.5% availability over 30 days
Error budget: 0.5% of requests = ~3.6 hours of full outage per month
### When budget is > 50% remaining
- Normal feature development and deployments permitted
- Risky infrastructure changes permitted with review
### When budget is 25β50% remaining
- Feature freeze on changes that affect request path
- All deployments require two approvers
- On-call rotation increased to 30-minute response SLA
### When budget is < 25% remaining
- Full feature freeze
- Engineering focus shifts to reliability exclusively
- No infrastructure changes without incident commander approval
### When budget is exhausted
- Production deployments suspended until budget recovers
- Post-incident review required before resuming normal operations
Grafana Dashboard for Error Budget
{
"title": "SLO Error Budget",
"panels": [
{
"title": "30-day Availability",
"type": "stat",
"targets": [
{
"expr": "job:http_availability:ratio30d * 100",
"legendFormat": "Availability %"
}
],
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 99},
{"color": "green", "value": 99.5}
]
}
},
{
"title": "Error Budget Remaining",
"type": "gauge",
"targets": [
{
"expr": "(job:http_availability:ratio30d - 0.995) / (1 - 0.995) * 100",
"legendFormat": "Budget %"
}
]
}
]
}
Security Hardening
A default K3s installation is functional but not hardened. The distance between βworksβ and βsecureβ is exactly where breaches happen. Security hardening is not a one-time task β it is a set of controls you implement, test, and maintain.
Pod Security Standards
Kubernetes 1.25+ replaced PodSecurityPolicies with Pod Security Standards. Enforce the restricted profile on production namespaces and baseline on everything else.
# Apply labels to enforce Pod Security Standards per namespace
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: latest
What restricted enforces:
- No privileged containers
- No privilege escalation (
allowPrivilegeEscalation: false) - Containers must run as non-root
- Root filesystem must be read-only
- All capabilities dropped, only specific ones re-added if needed
- Seccomp profile must be
RuntimeDefaultorLocalhost
A deployment that passes restricted validation looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: api
image: ghcr.io/your-org/api:sha-abc123
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: tmp
mountPath: /tmp # writable tmpfs for apps that need it
- name: cache
mountPath: /app/cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
Network Policies
By default, every pod in a Kubernetes cluster can reach every other pod across all namespaces. This is wrong for production. Define an explicit allow-list using NetworkPolicy.
# Default deny all ingress and egress in production
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow ingress to the API only from Traefik
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-from-traefik
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
app.kubernetes.io/name: traefik
ports:
- port: 8080
---
# Allow the API to reach PostgreSQL
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-postgres
namespace: production
spec:
podSelector:
matchLabels:
app: postgres
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: api
ports:
- port: 5432
---
# Allow DNS resolution for all pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
---
# Allow the API to reach external services (HTTPS only)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-egress-https
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Egress
egress:
- ports:
- port: 443
Audit Logging
The audit policy in the cloud-init config above is incomplete without the actual policy file. Define it explicitly β it is the record you need after a breach.
# /etc/rancher/k3s/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Log all requests to secrets β full body on creation/update
- level: RequestResponse
resources:
- group: ""
resources: ["secrets"]
verbs: ["create", "update", "patch", "delete"]
# Log exec and port-forward operations β high privilege, high risk
- level: RequestResponse
resources:
- group: ""
resources: ["pods/exec", "pods/portforward", "pods/attach"]
# Log RBAC changes
- level: RequestResponse
resources:
- group: "rbac.authorization.k8s.io"
resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
# Log authentication failures
- level: Metadata
omitStages:
- RequestReceived
users: ["system:anonymous"]
# Reduce noise from read-only operations on common resources
- level: None
resources:
- group: ""
resources: ["configmaps", "endpoints", "services"]
verbs: ["get", "list", "watch"]
# Default: log metadata for everything else
- level: Metadata
omitStages:
- RequestReceived
Image Security
Never pull :latest. Pin to a digest or a SHA-tagged image. Use a policy engine to enforce this.
# Kyverno ClusterPolicy β blocks latest tag and requires digest in production
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-image-digest
spec:
validationFailureAction: Enforce
background: false
rules:
- name: check-image-tag
match:
any:
- resources:
kinds: ["Pod"]
namespaces: ["production", "staging"]
validate:
message: "Production images must use a digest or SHA tag, not :latest or a mutable tag."
pattern:
spec:
containers:
- image: "*:sha-*"
Install Kyverno via Helm:
helm repo add kyverno https://kyverno.github.io/kyverno/
helm install kyverno kyverno/kyverno \
--namespace kyverno \
--create-namespace \
--set replicaCount=1 # single replica for K3s; use 3 for HA
CIS Benchmark Validation
Run kube-bench against the cluster to measure CIS compliance. This is the tool auditors expect to see results from.
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl wait --for=condition=complete job/kube-bench --timeout=120s
kubectl logs job/kube-bench
Key checks that K3s fails by default and how to fix them:
| Check | Default State | Fix |
|---|---|---|
| API server audit logging | Disabled | Add audit-log-path to k3s config |
| Anonymous authentication | Enabled | Add --anonymous-auth=false kubelet arg |
| Read-only port | 10255 open | Add --read-only-port=0 kubelet arg |
| Protect kernel defaults | Not set | Add --protect-kernel-defaults=true kubelet arg |
| Event rate limiting | Not set | Add --event-qps=0 kubelet arg |
Resource Optimization
Kubernetes gives you the tools to describe resource requirements. Most teams skip this step. The result is either a cluster where pods starve each other during peak load, or a cluster where you pay for three times the capacity you actually need.
Requests vs Limits
The critical distinction: requests determines scheduling (where the pod lands). limits determines enforcement (what happens when it exceeds the threshold). A pod with limits.cpu: 500m and requests.cpu: 100m can burst to 500m on a node with spare capacity, but is guaranteed 100m.
resources:
requests:
cpu: "100m" # guaranteed allocation for scheduling
memory: "128Mi" # guaranteed allocation for scheduling
limits:
cpu: "500m" # max burst β throttled if exceeded (not killed)
memory: "256Mi" # hard limit β OOMKilled if exceeded
Never set limits.memory below requests.memory. Never omit requests β without them, the scheduler has no data and places pods randomly.
VPA for Automatic Tuning
The Vertical Pod Autoscaler observes actual resource usage and recommends (or applies) better requests values. Use it in Off mode first to get recommendations without side effects.
# install VPA
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off" # Recommend only β do not mutate pods automatically
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: "50m"
memory: "64Mi"
maxAllowed:
cpu: "2000m"
memory: "1Gi"
After a week of observations, check recommendations:
kubectl describe vpa api-vpa -n production
# Look for "Recommendation:" section
# Lower bound = minimum safe, Target = recommended, Upper bound = spike headroom
Apply the recommended values to your Deployment manifest. Re-run the VPA in Off mode after each significant traffic change.
LimitRange and ResourceQuota per Namespace
Define defaults so that any pod without explicit requests still gets reasonable values:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default:
cpu: "200m"
memory: "256Mi"
defaultRequest:
cpu: "50m"
memory: "64Mi"
max:
cpu: "2000m"
memory: "2Gi"
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "8"
requests.memory: "16Gi"
limits.cpu: "16"
limits.memory: "32Gi"
pods: "50"
services: "20"
persistentvolumeclaims: "10"
Disaster Recovery
Disaster recovery is the practice of defining what you can survive and building the systems that let you survive it. The two numbers that matter are RTO (Recovery Time Objective β how long you can be down) and RPO (Recovery Point Objective β how much data you can lose).
Define these before building the DR system, not after. βWeβll figure it outβ is not a DR plan.
State Inventory
Before you can recover, you need to know what state exists and where it lives:
| State | Location | Backup Strategy | RPO |
|---|---|---|---|
| Cluster config (CRDs, RBAC, deployments) | etcd / SQLite | etcdctl snapshot or SQLite backup | 1 hour |
| Application database | PostgreSQL StatefulSet or external | pg_dump + WAL streaming | 5 minutes |
| Uploaded files / object store | S3 or Longhorn PVC | Cross-region replication | 15 minutes |
| Secrets | Sealed Secrets in git | Git history | Immediate |
| Container images | ghcr.io / Docker Hub | Pulled fresh from registry | Immediate |
K3s Datastore Backup
For single-node K3s (SQLite), the datastore is a file at /var/lib/rancher/k3s/server/db/state.db. Back it up with a systemd timer:
# /etc/systemd/system/k3s-backup.service
[Unit]
Description=K3s SQLite Backup
After=k3s.service
[Service]
Type=oneshot
User=root
ExecStart=/usr/local/bin/k3s-backup.sh
# /usr/local/bin/k3s-backup.sh
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/opt/k3s-backups"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="$BACKUP_DIR/k3s-state-$DATE.db"
mkdir -p "$BACKUP_DIR"
# stop writes temporarily for a consistent snapshot
systemctl stop k3s
cp /var/lib/rancher/k3s/server/db/state.db "$BACKUP_FILE"
systemctl start k3s
# compress
gzip "$BACKUP_FILE"
# upload to S3 (requires aws CLI configured)
aws s3 cp "$BACKUP_FILE.gz" "s3://your-backup-bucket/k3s/$(hostname)/$DATE.db.gz"
# keep only last 7 days locally
find "$BACKUP_DIR" -name "*.gz" -mtime +7 -delete
# /etc/systemd/system/k3s-backup.timer
[Unit]
Description=K3s SQLite Backup Timer
[Timer]
OnCalendar=*:0/30 # every 30 minutes
Persistent=true
[Install]
WantedBy=timers.target
For HA K3s with embedded etcd, use the built-in snapshot:
# manual snapshot
k3s etcd-snapshot save --name pre-upgrade-$(date +%Y%m%d)
# list snapshots
k3s etcd-snapshot list
# restore from snapshot (cluster must be stopped)
systemctl stop k3s
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/pre-upgrade-20260306.db
systemctl start k3s
Configure automatic etcd snapshots in k3s config:
# /etc/rancher/k3s/config.yaml
etcd-snapshot-schedule-cron: "*/30 * * * *"
etcd-snapshot-retention: 96 # keep 96 snapshots = 2 days at 30-minute intervals
etcd-snapshot-dir: /opt/k3s-snapshots
Application Database DR
PostgreSQL WAL archiving gives you point-in-time recovery to within seconds. With pgBackRest or WAL-G:
# pgBackRest backup from within the postgres pod
kubectl exec -it postgres-0 -n production -- \
pgbackrest --stanza=main backup --type=full
# WAL-G continuous archiving setup (add to postgresql.conf)
# archive_mode = on
# archive_command = 'wal-g wal-push %p'
# restore_command = 'wal-g wal-fetch %f %p'
Cluster Restore Runbook
Document the sequence before you need it. The runbook is the artifact that makes the difference between a 30-minute recovery and a 4-hour recovery.
## K3s Cluster Restore Runbook
### Preconditions
- Access to: S3 backup bucket, DNS provider, new VPS
- Required secrets: K3s token (in Vault), DB password (in Vault), TLS cert private key
### Step 1: Provision new node (Terraform)
```bash
cd infrastructure/terraform
terraform apply -var="environment=production" -target=hcloud_server.k3s_server
# wait for cloud-init to complete: ssh root@<new-ip> journalctl -f -u cloud-final
Step 2: Restore K3s datastore
# download latest backup
aws s3 cp s3://your-backup-bucket/k3s/latest.db.gz /tmp/
gunzip /tmp/latest.db.gz
systemctl stop k3s
cp /tmp/latest.db /var/lib/rancher/k3s/server/db/state.db
systemctl start k3s
Step 3: Verify cluster
kubectl get nodes # should show Ready
kubectl get pods -A # should show pods recovering
Step 4: Update DNS
Point your domain A record to the new VPS IP. TTL should be 60s in production (set it before an incident, not during).
Step 5: Verify application
Run smoke tests against the new IP before cutting DNS.
Expected RTO: 25 minutes
Testing DR
A DR plan you have never tested is a DR plan that will fail. Run a full DR drill quarterly:
# simulate a full cluster loss on staging
terraform destroy -target=hcloud_server.k3s_server -var="environment=staging"
# start the timer
# run the runbook from step 1
# measure actual RTO vs target RTO
# document in the incident log what was slower than expected
# update the runbook before the next quarter
Upgrade Strategy
K3s releases follow Kubernetes upstream with a 2-4 week lag. Running more than 2 minor versions behind means you are missing security patches. Running on a version the K3s team no longer supports means you are on your own.
The system-upgrade-controller automates rolling upgrades across your nodes:
# Plan CRD β upgrades server nodes to a specific K3s version
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: k3s-server-upgrade
namespace: system-upgrade
spec:
concurrency: 1 # upgrade one node at a time
cordon: true # cordon node before upgrade, uncordon after
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
channel: https://update.k3s.io/v1-release/channels/stable
# or pin to a version:
# version: v1.29.3+k3s1
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
operator: Exists
Before any cluster upgrade:
- Take a datastore snapshot
- Check the Kubernetes changelog for API deprecations
- Validate your Helm chart versions against the new API versions
- Run the upgrade on staging first β let it soak for 48 hours
- Apply to production during low-traffic window with a rollback plan documented
Observability as Infrastructure
The observability stack is not optional infrastructure. It is the system that tells you whether your SLOs are being met, whether your security policies are working, and whether you have enough capacity. It deserves the same reliability and operational rigor as the applications it monitors.
Structured Logging Pipeline
Applications should emit structured JSON logs. The collection pipeline β Promtail β Loki β Grafana β should have its own resource quotas and backup.
# DaemonSet for Promtail β collects logs from all nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: monitoring
spec:
selector:
matchLabels:
app: promtail
template:
spec:
serviceAccountName: promtail
tolerations:
- effect: NoSchedule
operator: Exists
containers:
- name: promtail
image: grafana/promtail:2.9.4
args:
- -config.file=/etc/promtail/promtail.yaml
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "200m"
memory: "128Mi"
volumes:
- name: config
configMap:
name: promtail-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Alertmanager Routing
Alerts that go to the wrong channel get ignored. Route by severity and namespace:
# alertmanager.yaml β route critical production alerts to PagerDuty, rest to Slack
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "namespace", "severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: "slack-default"
routes:
- match:
severity: critical
namespace: production
receiver: "pagerduty-production"
continue: true # also send to Slack
- match:
severity: critical
namespace: staging
receiver: "slack-oncall"
- match:
severity: warning
receiver: "slack-alerts"
receivers:
- name: "pagerduty-production"
pagerduty_configs:
- routing_key: "${PAGERDUTY_KEY}"
description: '{{ template "pagerduty.default.description" . }}'
- name: "slack-oncall"
slack_configs:
- api_url: "${SLACK_WEBHOOK_URL}"
channel: "#oncall"
title: "{{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.summary }}\nRunbook: {{ .Annotations.runbook }}{{ end }}"
- name: "slack-alerts"
slack_configs:
- api_url: "${SLACK_WEBHOOK_URL}"
channel: "#alerts"
The Infrastructure Mindset
Kubernetes clusters are not pets. The goal of every practice in this post β IaC, secrets management, SLOs, hardening, DR β is to make the cluster boring. A boring cluster is one that provisions identically every time, fails predictably, recovers automatically, and never surprises the on-call engineer at 3 AM.
The measure of a mature infrastructure is not uptime. Uptime is a lagging indicator. The measure is time-to-restore after failure. A team that can rebuild the full cluster in 25 minutes from a known-good backup can tolerate catastrophic failures. A team that has never tested their DR procedure cannot.
The on-call engineer who never had to use the runbook is the engineer who will be lost when the runbook matters. Test your recovery. The drill is the work.
Tags
Related Articles
Building Automation Services with Go: Practical Tools & Real-World Solutions
Master building useful automation services and tools with Go. Learn to create production-ready services that solve real problems: log processors, API monitors, deployment tools, data pipelines, and more.
Automation with Go: Building Scalable, Concurrent Systems for Real-World Tasks
Master Go for automation. Learn to build fast, concurrent automation tools, CLI utilities, monitoring systems, and deployment pipelines. Go's concurrency model makes it perfect for real-world automation.
Automation Tools for Developers: Real Workflows Without AI - CLI, Scripts & Open Source
Master free automation tools for developers. Learn to automate repetitive tasks, workflows, deployments, monitoring, and operations. Build custom automation pipelines with open-source toolsβno AI needed.