Data Analysis for Backend Engineers: Using Metrics to Make Better Technical Decisions
Master data analysis as a backend engineer. Learn to collect meaningful metrics, analyze performance data, avoid common pitfalls, and make technical decisions backed by evidence instead of hunches.
Introduction: Stop Making Decisions Based on Hunches
You’re in a meeting. The question comes up: “Should we migrate from PostgreSQL to MongoDB?”
Team A says: “MongoDB is faster. We should switch.”
Team B says: “PostgreSQL is more reliable. Don’t change.”
Your lead asks you: “What do you think?”
You don’t have data. You have opinions. You have feelings. You have experiences with other systems.
So you guess.
Three months later, you’ve migrated. You’ve spent 200 engineering hours. And you discover: MongoDB wasn’t actually faster for your workload. It’s slightly slower. But now you’re stuck.
This happens because engineers don’t think like analysts.
Engineers think in terms of: Does this work? Is it elegant? Will it scale?
Analysts think in terms of: What’s the evidence? What does the data show? What’s the uncertainty?
Here’s the painful truth: Most backend engineers are terrible at data analysis.
Not because they’re not smart. But because they were never taught to think this way.
Most decisions you make should be backed by data. Not gut feel. Not industry trends. Not what worked at your previous company.
Data.
This guide teaches you how to think like an analyst while staying an engineer. How to collect the right metrics. How to extract signal from noise. How to make technical decisions based on evidence.
And most importantly: how to avoid the common mistakes that make engineers distrust their own data.
Chapter 1: Why Engineers Avoid Data Analysis
Before we dive in, let’s understand the resistance.
The Myth of “I’ll Just Know”
Many engineers believe they can intuit whether something is working.
“I can feel the latency. It seems slower.”
“The error rate seems higher.”
“This feels like it’ll scale better.”
Your intuition is garbage.
Not because you’re not smart. But because human brains are terrible at:
- Quantifying - You can’t accurately estimate “feels slower” in milliseconds
- Averaging - You remember outliers, not the average case
- Comparing - You can’t compare two things without a baseline
- Detecting patterns - Your brain finds patterns that don’t exist (false positives)
- Handling uncertainty - You either believe something completely or not at all
Example: You make a change. The system “feels” faster.
But you didn’t measure:
- Were requests actually faster on average?
- Or did you just notice the fast ones?
- Did performance improve for everyone, or just for certain request types?
- Was there a day-of-week effect (Mondays are always slower)?
- Did you actually change something, or did load happen to decrease?
Without data, you can’t answer these questions.
The Paralysis of “Need Perfect Data”
The other extreme: engineers who avoid analysis because they think they need perfect data.
“We don’t have enough historical data.”
“The metrics might be inaccurate.”
“What if there are confounding variables?”
Paralysis. Analysis paralysis.
You don’t need perfect data. You need good enough data to make a decision better than a guess.
The False Choice: Analysis vs. Shipping
Some engineers think: “Analysis slows us down. We need to move fast.”
Wrong choice.
Analysis doesn’t slow you down. Bad decisions slow you down. Wasted work slows you down. Rollbacks slow you down.
A 30-minute analysis that prevents a bad migration saves you 200 hours.
That’s not slow. That’s fast.
Chapter 2: Statistical Thinking for Engineers
Before you analyze data, you need to think differently.
The Three Statistical Concepts That Matter
Concept 1: Average vs. Distribution
Most engineers think of metrics as single numbers.
“Our latency is 100ms.”
Wrong. There’s no single latency. There’s a distribution.
Request 1: 95ms
Request 2: 102ms
Request 3: 98ms
Request 4: 1500ms (timeout retry)
Request 5: 101ms
Average: 379ms
Median: 101ms
p99: 1500ms
Which one matters?
It depends on what you care about.
- Average - Useful for total resource consumption
- Median - Useful for typical user experience
- p99 - Useful for worst-case scenarios
A system with average 100ms but p99 5000ms is terrible for users, even though the average looks good.
This is why monitoring p99, p95, and p50 matters more than average.
Concept 2: Variance and Noise
Not all differences are real. Some are noise.
Example: You measure API latency on Monday and Tuesday.
Monday: 105ms average Tuesday: 98ms average
Is the system faster on Tuesday?
Maybe. Or maybe Tuesday had less traffic. Or lower CPU load. Or different network conditions.
You need to know: Is the difference real, or just noise?
Signal to Noise Ratio - Can you detect real changes amid normal variation?
Low ratio: You can’t tell if things changed. Too much noise.
High ratio: Real changes are obvious.
To increase the ratio:
- Collect more data (more samples = more certainty)
- Reduce variability (control confounding factors)
- Look at trends, not individual measurements
Concept 3: Causation vs. Correlation
This is the biggest mistake engineers make.
You see two things that change together. You assume one caused the other.
Example: After you optimize the database, latency decreases.
Conclusion: The optimization worked.
But what if:
- Traffic naturally decreased that day
- Caching took effect (unrelated to your change)
- A different team’s deployment also helped
- It’s just random variation
This is why you need controlled experiments or before/after analysis with confound control.
The Data-Driven Decision Framework
Here’s the framework you should use:
Step 1: Define the Question
Not “Is this faster?” but “Is this faster for request type X, at the p99, under normal traffic conditions?”
Specific questions lead to meaningful answers.
Step 2: Identify Confounds
What else might affect the result?
- Traffic volume
- Time of day
- Day of week
- Number of concurrent requests
- Cache state
- Other deployments
- Hardware changes
If you can’t control confounds, at least be aware of them.
Step 3: Collect Data
Before and after. Long enough to see patterns. With enough detail to answer your question.
Don’t collect for 5 minutes. Collect for a week. Or a month.
Step 4: Analyze
Look for the signal. Is the change real, or noise?
Step 5: Decide
Based on the data, what should you do?
Chapter 3: Collecting the Right Metrics
You can’t analyze what you don’t measure.
Metric Types
Type 1: Application Metrics
What your code is doing.
- Request latency (p50, p95, p99)
- Error rates
- Throughput (requests/second)
- Business metrics (users, transactions, revenue)
Example in Go:
import "github.com/prometheus/client_golang/prometheus"
var (
requestLatency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "request_latency_seconds",
Help: "Request latency in seconds",
Buckets: []float64{0.001, 0.01, 0.1, 0.5, 1, 5},
},
[]string{"endpoint", "method"},
)
errorCount = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "errors_total",
Help: "Total error count",
},
[]string{"endpoint", "error_type"},
)
)
func handleRequest(w http.ResponseWriter, r *http.Request) {
start := time.Now()
defer func() {
duration := time.Since(start).Seconds()
requestLatency.WithLabelValues(r.URL.Path, r.Method).Observe(duration)
}()
// Your handler code
if err != nil {
errorCount.WithLabelValues(r.URL.Path, "internal_error").Inc()
http.Error(w, "Error", 500)
}
}
Type 2: System Metrics
How the system is behaving.
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
- Database connection pool usage
Type 3: Business Metrics
What matters to the business.
- Daily Active Users (DAU)
- User retention
- Feature adoption
- Conversion rates
- Revenue per user
- Churn rate
The Metrics Selection Framework
You can’t measure everything. So which metrics matter?
Rule 1: Measure Outcomes, Not Activity
Bad metrics:
- “Code deployments per week” (activity)
- “Lines of code committed” (activity)
- “Number of tests” (activity)
Good metrics:
- “Mean time to recovery after failure” (outcome)
- “System availability” (outcome)
- “User-reported bugs per 10,000 requests” (outcome)
Why? Because outcomes are what actually matter.
Rule 2: Metrics Should Be Actionable
If the metric goes down, can you do something about it?
Bad metric:
- “API latency is 105ms” - Too vague. What can you do?
Good metric:
- “p99 latency for user search is 500ms, caused by N+1 queries in the database layer” - Specific. Actionable.
Rule 3: Start with 3-5 Key Metrics
Don’t try to measure everything. Pick the 3-5 things that matter most.
For most backend systems:
- Request latency (p99)
- Error rate
- Throughput
- Resource utilization (CPU, memory)
- Business metric (whatever matters: users, revenue, etc.)
Master these. Add more later.
Chapter 4: Analyzing Performance Data
Now you have data. How do you extract meaning?
Pattern 1: Baseline and Anomalies
Every metric has a normal range.
Normal latency: 80-120ms
Normal error rate: 0.01%
Anomaly: latency 500ms (outside normal)
Anomaly: error rate 2% (outside normal)
Your job is to:
- Define “normal” (baseline)
- Detect anomalies (deviation from baseline)
- Investigate causes
- Respond
Example in Go:
type Anomaly struct {
Metric string
Value float64
Baseline float64
Threshold float64
Severity string
}
func detectAnomalies(metrics map[string]float64, baselines map[string]float64) []Anomaly {
var anomalies []Anomaly
for metric, value := range metrics {
baseline := baselines[metric]
threshold := baseline * 0.2 // 20% deviation threshold
if math.Abs(value-baseline) > threshold {
severity := "warning"
if math.Abs(value-baseline) > baseline*0.5 { // 50% deviation
severity = "critical"
}
anomalies = append(anomalies, Anomaly{
Metric: metric,
Value: value,
Baseline: baseline,
Threshold: threshold,
Severity: severity,
})
}
}
return anomalies
}
Pattern 2: Trends Over Time
Look at how metrics change over days/weeks/months.
Good trends:
- Error rate trending down (fewer bugs)
- Latency trending down (getting faster)
- Resource utilization trending down (optimization working)
Bad trends:
- Latency trending up (getting slower)
- Error rate trending up (more problems)
- Database query time trending up (queries getting slower)
Example:
type MetricTrend struct {
Metric string
Current float64
Previous float64
Direction string // "up", "down", "flat"
Percentage float64
WeekOverWeek bool
}
func calculateTrend(current float64, previous float64) MetricTrend {
if previous == 0 {
return MetricTrend{Direction: "flat", Percentage: 0}
}
percentage := ((current - previous) / previous) * 100
direction := "flat"
if percentage > 5 {
direction = "up"
} else if percentage < -5 {
direction = "down"
}
return MetricTrend{
Current: current,
Previous: previous,
Direction: direction,
Percentage: percentage,
}
}
Pattern 3: Correlation (But Not Causation!)
When two metrics move together, they’re correlated.
Example:
- CPU usage goes up → Latency goes up
- Cache hit rate goes up → Latency goes down
- Error rate goes up → User complaints go up
Correlation can suggest cause. But don’t assume it.
func calculateCorrelation(x []float64, y []float64) float64 {
if len(x) != len(y) || len(x) == 0 {
return 0
}
// Calculate means
meanX := 0.0
meanY := 0.0
for i := range x {
meanX += x[i]
meanY += y[i]
}
meanX /= float64(len(x))
meanY /= float64(len(y))
// Calculate correlation
numerator := 0.0
denomX := 0.0
denomY := 0.0
for i := range x {
dx := x[i] - meanX
dy := y[i] - meanY
numerator += dx * dy
denomX += dx * dx
denomY += dy * dy
}
if denomX == 0 || denomY == 0 {
return 0
}
return numerator / (math.Sqrt(denomX) * math.Sqrt(denomY))
}
// Usage:
// correlation := calculateCorrelation(latencies, errorRates)
// if correlation > 0.7 {
// log.Println("Strong correlation between latency and errors")
// }
Chapter 5: Understanding Performance Bottlenecks
Data is useless if you can’t interpret it.
Methodology: The Five Whys
When you see a problem (high latency, high error rate), dig deeper.
Why? → Why? → Why? → Why? → Why? → Root cause
Example:
Problem: Latency p99 is 2000ms (bad)
Why? Database queries are slow (p99: 1500ms)
Why? Table scan happening on large table
Why? Missing index on the query’s WHERE clause
Why? Index never created because requirement was unclear
Why? Communication between teams was poor
Root cause: Communication issue → Process issue → Technical symptom
You can’t fix the latency by optimizing code. You need to:
- Add the index (technical fix)
- Improve communication (process fix)
Pattern Recognition: Common Bottlenecks
Bottleneck Type 1: Database
Symptoms:
- High database CPU
- Slow database queries
- Increasing query latency over time
Analysis:
// Analyze query latencies
type QueryAnalysis struct {
Query string
Count int
AvgTime float64
MaxTime float64
P99Time float64
Calls string // "N+1?" "Indexed?" "Full scan?"
}
func analyzeQueryPerformance(slowQueries []QueryLog) {
// Group by query
grouped := make(map[string][]float64)
for _, q := range slowQueries {
grouped[q.Query] = append(grouped[q.Query], q.Duration)
}
// Analyze each
for query, durations := range grouped {
sort.Float64s(durations)
avg := calculateAverage(durations)
p99 := durations[int(float64(len(durations))*0.99)]
if p99 > 1000 { // 1 second is bad
log.Printf("Slow query: %s (p99: %dms, avg: %dms, count: %d)",
query, int(p99), int(avg), len(durations))
}
}
}
Solutions:
- Add indexes
- Reduce columns selected
- Fix N+1 queries
- Implement caching
- Denormalize if needed
Bottleneck Type 2: Memory/GC
Symptoms:
- High memory usage
- GC pauses visible in latency
- Out of memory errors
Analysis:
- Check heap allocation
- Profile allocations by type
- Look for leaks
Bottleneck Type 3: External Dependencies
Symptoms:
- Your latency is normal, but user-facing latency is high
- Waiting for API calls
- Network latency
Analysis:
// Track external call latencies
type ExternalCallMetric struct {
Service string
Latency float64
ErrorRate float64
Timeout bool
}
// If external service is slow, your system is slow
// You can't fix their latency, but you can:
// 1. Add caching
// 2. Add timeouts (fail fast)
// 3. Use circuit breaker
// 4. Make calls parallel
Chapter 6: A/B Testing and Experimentation
The gold standard of data-driven decisions: controlled experiments.
When to A/B Test
Use A/B tests when:
- You’re unsure which approach is better
- You want to measure user impact
- You want to isolate cause from confounds
Example scenarios:
- New algorithm vs. old algorithm
- UI layout A vs. layout B
- Caching strategy 1 vs. strategy 2
- Database X vs. database Y
Running a Simple Experiment
Step 1: Define Hypothesis
Not: “This is better”
But: “Switching to Redis caching will reduce p99 latency by at least 20% for user search”
Step 2: Design Experiment
- Control group: Old system (Redis caching off)
- Treatment group: New system (Redis caching on)
- Sample: 50% of traffic each
- Duration: 1 week (enough data)
- Metric: p99 latency for user search
Step 3: Run and Collect Data
type Experiment struct {
Name string
Control []float64 // Latencies for control group
Treatment []float64 // Latencies for treatment group
Duration time.Duration
}
func recordExperiment(w http.ResponseWriter, r *http.Request, exp *Experiment) {
start := time.Now()
// Route to control or treatment
if rand.Float64() < 0.5 {
// Control: old behavior
handleWithoutCache(w, r)
} else {
// Treatment: new behavior
handleWithCache(w, r)
}
latency := time.Since(start).Seconds()
// Record in appropriate group
}
Step 4: Analyze Results
func analyzeExperiment(exp *Experiment) {
controlAvg := calculateAverage(exp.Control)
treatmentAvg := calculateAverage(exp.Treatment)
improvement := ((controlAvg - treatmentAvg) / controlAvg) * 100
controlP99 := calculatePercentile(exp.Control, 0.99)
treatmentP99 := calculatePercentile(exp.Treatment, 0.99)
p99Improvement := ((controlP99 - treatmentP99) / controlP99) * 100
// Is this statistically significant?
pValue := calculatePValue(exp.Control, exp.Treatment)
fmt.Printf("Average improvement: %.2f%%\n", improvement)
fmt.Printf("P99 improvement: %.2f%%\n", p99Improvement)
fmt.Printf("P-value: %.4f (< 0.05 means significant)\n", pValue)
if pValue < 0.05 && p99Improvement > 20 {
fmt.Println("✓ Hypothesis confirmed. Deploy treatment.")
} else {
fmt.Println("✗ Not enough evidence. Keep investigating.")
}
}
Chapter 7: Making Technical Decisions from Data
Now you have data. How do you decide?
The Decision Framework
Step 1: Collect Evidence
- Metrics before/after
- Profiling data
- Logs from incidents
- User reports
- Business impact
Step 2: Quantify the Trade-off
Every decision has a cost and a benefit.
Example: Migrate to Go from Python
Costs:
- 200 engineering hours
- Relearning (team ramp up time)
- Risk of bugs during migration
- Operational complexity change
Benefits:
- 50% latency reduction (saves infrastructure costs)
- Better scalability (can handle 10x traffic)
- Faster deployments (fewer runtime errors)
Data:
Current Python system costs:
- Infrastructure: $50k/month
- Operations time: 40 hours/week
With Go:
- Infrastructure: $25k/month
- Operations time: 20 hours/week
Payback period:
- Migration cost: 200 hours × $150/hour = $30k
- Monthly savings: ($50k - $25k) + (20 hours/week × $150/hour × 4) = $25k + $12k = $37k
- Payback: ~1 month
Decision: High ROI. Do it.
Step 3: Identify Uncertainties
What don’t you know?
- Will the Go migration actually save that much?
- What if there are unexpected bugs?
- What if the team takes longer to learn Go?
Quantify uncertainty:
type Decision struct {
Option string
Benefit float64
Cost float64
Uncertainty float64 // 0.1 = 10% uncertainty
ExpectedValue float64
}
// Expected value accounts for uncertainty
// EV = (Benefit - Cost) × (1 - Uncertainty)
decisions := []Decision{
{
Option: "Migrate to Go",
Benefit: 37000,
Cost: 30000,
Uncertainty: 0.3, // 30% chance we're wrong
ExpectedValue: (37000 - 30000) * (1 - 0.3), // = 4900
},
{
Option: "Optimize Python",
Benefit: 10000,
Cost: 5000,
Uncertainty: 0.1, // We're pretty sure
ExpectedValue: (10000 - 5000) * (1 - 0.1), // = 4500
},
}
// Go migration has higher expected value despite higher uncertainty
Step 4: Decide
Based on data, which option has the best expected value?
Chapter 8: Common Analysis Mistakes
Where engineers go wrong.
Mistake 1: Cherry-Picking Data
You want Go to be faster. So you benchmark the hot path. You show Go is 2x faster on that specific code.
What you’re hiding:
- 99% of your code isn’t on the hot path
- Overall system might only be 5% faster
- Startup time might be slower
- Compilation adds complexity
Solution: Measure the full system. Measure end-to-end.
Mistake 2: Ignoring Variance
You measure latency once. 95ms. Great.
But latency has natural variance. Sometimes 50ms, sometimes 200ms.
You need:
- Multiple measurements (100+)
- Percentiles (p50, p95, p99)
- Standard deviation (how much variation)
type LatencyStats struct {
Count int
Mean float64
StdDev float64
P50 float64
P95 float64
P99 float64
Min float64
Max float64
}
func analyzeLatencies(latencies []float64) LatencyStats {
sort.Float64s(latencies)
mean := calculateMean(latencies)
stdDev := calculateStdDev(latencies, mean)
return LatencyStats{
Count: len(latencies),
Mean: mean,
StdDev: stdDev,
P50: latencies[int(float64(len(latencies))*0.50)],
P95: latencies[int(float64(len(latencies))*0.95)],
P99: latencies[int(float64(len(latencies))*0.99)],
Min: latencies[0],
Max: latencies[len(latencies)-1],
}
}
Mistake 3: Confusing Correlation with Causation
You add Redis caching. Latency drops.
Conclusion: Caching works.
But what if:
- Traffic decreased that day
- A CPU bug fix was deployed separately
- Database query optimizer improved
- It’s just random variation
Solution: Control experiments. Isolate variables.
Mistake 4: Measuring for Long Enough
You measure for 1 hour and see improvement.
But:
- Time-of-day effects (peak hours vs. off-peak)
- Day-of-week effects (Mondays are different)
- Cache warm-up effects (first run is different)
- Seasonal effects
Solution: Measure for at least 1 week. Better: 1 month.
Mistake 5: Not Accounting for Load
Your system is fast with 100 requests/second.
What about 10,000 requests/second?
Performance isn’t a constant. It depends on load.
Solution: Load test different scenarios. Measure at production load.
Chapter 9: Building Data Analysis Into Your Workflow
Data only matters if you use it.
Pattern 1: Automatic Alerting
Don’t wait to analyze data. Alert when anomalies happen.
type AlertRule struct {
Metric string
Threshold float64
Condition string // "above", "below"
Duration time.Duration
AlertTo string // Slack channel
}
func monitorMetrics(metrics map[string]float64, rules []AlertRule) {
for _, rule := range rules {
value := metrics[rule.Metric]
var triggered bool
if rule.Condition == "above" {
triggered = value > rule.Threshold
} else {
triggered = value < rule.Threshold
}
if triggered {
sendAlert(fmt.Sprintf("🚨 %s is %v (threshold: %v)",
rule.Metric, value, rule.Threshold),
rule.AlertTo)
}
}
}
Pattern 2: Weekly Analytics Reports
Summarize the week. Show trends. Highlight anomalies.
type WeeklyReport struct {
Week string
MetricsTrends map[string]MetricTrend
Anomalies []Anomaly
Incidents []Incident
Decisions []Decision
Recommendations []string
}
func generateWeeklyReport() WeeklyReport {
// Collect week's data
// Calculate trends
// Identify anomalies
// Generate insights
// Create recommendations
// Send to team via email/Slack
}
Pattern 3: Experimentation Culture
Make A/B testing part of your standard process.
When proposing a change:
- Write hypothesis
- Design experiment
- Run for 1 week
- Analyze results
- Decide based on data
Chapter 10: Real-World Example: Optimizing Database Performance
Let’s walk through a complete example.
The Problem
Your API latency p99 is 500ms. Too high. Users are unhappy.
Step 1: Collect Data
// Measure latency by endpoint
type EndpointMetrics struct {
Endpoint string
P99 float64
P95 float64
P50 float64
ErrorRate float64
QueryTime float64
}
func gatherMetrics() []EndpointMetrics {
return []EndpointMetrics{
{Endpoint: "/api/users", P99: 800, P95: 200, P50: 50, ErrorRate: 0.01, QueryTime: 750},
{Endpoint: "/api/posts", P99: 300, P95: 150, P50: 40, ErrorRate: 0.005, QueryTime: 50},
{Endpoint: "/api/search", P99: 2000, P95: 800, P50: 100, ErrorRate: 0.02, QueryTime: 1900},
}
}
Step 2: Identify the Problem
// Bottleneck: Database queries are slow
// Worst: /api/search (1900ms in database!)
// Root cause analysis:
// 1. Check database CPU → High (80%)
// 2. Check slow query log → Found it
// 3. Query: SELECT * FROM posts WHERE user_id = ? ORDER BY created_at DESC LIMIT 10
// This is a full table scan. No index on (user_id, created_at).
Step 3: Propose Solution
Add index: CREATE INDEX idx_posts_user_created ON posts(user_id, created_at DESC)
Step 4: Experiment
Control group (50%): Old system (no index) Treatment group (50%): New system (with index)
Run for 1 week.
Step 5: Analyze Results
// Before: p99 latency for /api/search = 2000ms
// After: p99 latency for /api/search = 150ms
// That's 93% improvement!
// Database CPU drops from 80% to 40%
// Statistical significance: p-value = 0.001 (highly significant)
// Decision: ✓ Deploy the index
ROI Calculation
Cost:
- 2 hours to analyze: $300
- 1 hour to add index: $150
- Testing: 1 hour: $150
- Total: $600
Benefits:
- Infrastructure savings: 40% fewer database resources = $5k/month
- Better UX: reduced latency = retention improvement
- On-call stress reduced: fewer incidents
Payback period: ~4 hours
ROI: Infinite. Do it.
Chapter 11: Avoiding Analysis Paralysis
A common problem: overthinking.
You want to make the “perfect” decision. So you analyze forever.
Meanwhile, competitors move fast.
How to Decide When You Have Enough Data
Rule 1: 70% confidence is enough
You don’t need 100% certainty. You need enough evidence to act.
70% confidence = “This is probably right” = good enough
95% confidence = “This is definitely right” = wasting time
Rule 2: Time-box your analysis
Don’t analyze for weeks. Analyze for days.
- Day 1: Collect data
- Day 2: Initial analysis
- Day 3: Deeper analysis and decision
If you haven’t decided by day 3, you’re overthinking.
Rule 3: Iterative decisions
You don’t have to make the perfect decision once.
You can make a good decision now, monitor the outcome, and adjust.
Iteration beats perfection.
Chapter 12: Building a Data-Driven Culture
This isn’t just about you. It’s about your team.
Creating a Culture of Evidence
Step 1: Make decisions transparent
When you make a decision, show your data:
“We chose PostgreSQL over MongoDB because:”
- Transaction support (needed for payments)
- Latency benchmarks show 20% better performance for our query patterns
- Operational complexity lower (team familiar with it)
Step 2: Celebrate evidence-based wins
When an experiment validates a hypothesis, celebrate it.
“We predicted caching would reduce latency by 20%. It reduced it by 25%. Nice work!”
Step 3: Learn from wrong predictions
When you’re wrong, don’t hide it.
“We thought this optimization would help. It didn’t. Here’s why we were wrong. Here’s what we learned.”
Step 4: Make analysis easy
Provide tools and dashboards so anyone can look at data:
- Prometheus dashboards
- Grafana dashboards
- SQL queries for ad-hoc analysis
Appendix A: Common Metrics Reference
Application Metrics:
- Request latency (p50, p95, p99)
- Error rate (5xx errors / total requests)
- Throughput (requests/second)
- Cache hit rate (cache hits / total requests)
Database Metrics:
- Query latency (p50, p95, p99)
- Slow queries (> 1 second)
- Connections used / max
- Query per second
Infrastructure Metrics:
- CPU usage (%)
- Memory usage (%)
- Disk I/O (MB/s)
- Network I/O (MB/s)
Business Metrics:
- Daily Active Users (DAU)
- Monthly Active Users (MAU)
- User retention rate
- Conversion rate
- Revenue per user
Appendix B: Statistical Concepts Quick Reference
Mean (Average): Sum all values / count
Median (P50): Middle value when sorted
Percentile (P95): 95% of values are below this
Standard Deviation: How spread out values are
Correlation: -1 to 1. How two metrics move together
P-value: < 0.05 means statistically significant
Confidence Interval: Range where true value likely falls
Appendix C: Data Analysis Checklist
Before making a technical decision:
- Define specific, measurable hypothesis
- Identify confounding variables
- Collect sufficient data (1+ week)
- Analyze distribution, not just averages
- Look at percentiles (p50, p95, p99)
- Check for anomalies/outliers
- Calculate statistical significance
- Consider alternative explanations
- Quantify business impact (ROI)
- Document assumptions and uncertainties
- Share findings with team
- Implement monitoring post-decision
- Review results after deployment
Appendix D: Tools for Analysis
Monitoring & Visualization:
- Prometheus (metrics collection)
- Grafana (dashboards)
- Datadog (hosted monitoring)
- ELK Stack (logs)
Analysis & Statistics:
- SQL (queries)
- Python + Pandas (analysis)
- R (statistics)
- Go + gonum (statistics package)
Experimentation:
- Custom A/B testing framework
- Statsig (feature flags + experimentation)
- LaunchDarkly (feature flags)
Appendix E: Further Reading
Recommended books:
- “Thinking, Fast and Slow” - Daniel Kahneman (decision-making psychology)
- “Lean Analytics” - Alistair Croll and Benjamin Yoskovitz (startup metrics)
- “The Drunkard’s Walk” - Leonard Mlodinow (probability and statistics)
Conclusion: Data is Your Competitive Advantage
Most backend engineers don’t think analytically. They ship features based on hunches.
You can be different.
By learning to collect the right metrics, analyze data rigorously, and make decisions backed by evidence, you become someone who makes better technical decisions.
Your system is more reliable. Your decisions have higher ROI. Your team trusts your judgment.
That’s competitive advantage.
Start small. Pick one metric. Start measuring. Start analyzing. Start deciding based on evidence.
The data is waiting. Go find it.
Tags
Related Articles
Organizational Health Through Architecture: Building Alignment, Trust & Healthy Culture
Learn how architecture decisions shape organizational culture, health, and alignment. Discover how to use architecture as a tool for building trust, preventing silos, enabling transparency, and creating sustainable organizational growth.
The Data-Driven Architect: Communicating Through Numbers, Data & Diagrams
Master the art of communicating architecture decisions using data visualization, metrics, and diagrams. Learn how to tell compelling stories with numbers that stakeholders believe and understand.
Building Automation Services with Go: Practical Tools & Real-World Solutions
Master building useful automation services and tools with Go. Learn to create production-ready services that solve real problems: log processors, API monitors, deployment tools, data pipelines, and more.