After implementing DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service), management noticed a dramatic improvement in metrics across all teams. However, there was no corresponding improvement in customer satisfaction or business outcomes.
# DevOps Metrics and KPIs Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Analyzed the raw data behind the metrics calculations.
Interviewed team members about their development and deployment processes.
Compared metrics with actual business outcomes.
Reviewed changes in development practices since metrics implementation.
Examined the definition and implementation of each metric.
Root Cause:
Teams had found ways to game the metrics without improving actual performance: 1. Deployment Frequency was increased by deploying tiny, inconsequential changes 2. Lead Time was artificially reduced by breaking work into smaller tickets after development was complete 3. Change Failure Rate was manipulated by not reporting certain types of failures 4. Time to Restore Service was improved by marking incidents as resolved before they were fully fixed
Fix/Workaround:
• Short-term: Implemented more rigorous metric definitions:
# Metric definitions YAML
metrics:
deployment_frequency:
definition: "Number of successful deployments to production per day"
measurement:
- Count only deployments that deliver actual user value
- Minimum change size of 100 lines of code or equivalent
- Must pass all quality gates
gaming_prevention:
- Random audits of deployments to verify value delivery
- Correlation with feature flags or A/B test activations
lead_time:
definition: "Time from code commit to successful deployment in production"
measurement:
- Start time: First commit related to a user story
- End time: Deployment to production with feature active
- Must include all related pull requests
gaming_prevention:
- Track story points or complexity to detect work fragmentation
- Verify that story breakdown occurs during planning, not execution
change_failure_rate:
definition: "Percentage of deployments causing a failure in production"
measurement:
- Include all incidents requiring remediation
- Include degraded service even if not complete outage
- Count by deployment, not by individual failure
gaming_prevention:
- Automated detection of service degradation
- Customer-reported issues tracked and correlated with deployments
time_to_restore:
definition: "Time from failure detection to service restoration"
measurement:
- Start time: First alert or customer report
- End time: Full resolution, not just mitigation
- Must include verification of fix
gaming_prevention:
- Require evidence of resolution
- Track recurrence of similar incidents
• Long-term: Developed a balanced scorecard approach:
// DevOps Balanced Scorecard Implementation
interface MetricDefinition {
name: string;
description: string;
calculation: string;
dataSource: string;
owner: string;
targetValue: number;
minValue: number;
maxValue: number;
weight: number;
gameability: 'low' | 'medium' | 'high';
countermeasures: string[];
}
interface BalancedScorecard {
technicalMetrics: MetricDefinition[];
processMetrics: MetricDefinition[];
businessMetrics: MetricDefinition[];
cultureMetrics: MetricDefinition[];
}
const devOpsScorecard: BalancedScorecard = {
technicalMetrics: [
{
name: 'Deployment Frequency',
description: 'How often code is deployed to production',
calculation: 'Count of deployments per day',
dataSource: 'CI/CD Pipeline',
owner: 'DevOps Team',
targetValue: 3,
minValue: 0,
maxValue: 10,
weight: 0.15,
gameability: 'high',
countermeasures: [
'Minimum change size requirements',
'Value delivery validation'
]
},
{
name: 'Test Coverage',
description: 'Percentage of code covered by automated tests',
calculation: 'Lines covered / Total lines',
dataSource: 'Test Coverage Tool',
owner: 'QA Team',
targetValue: 80,
minValue: 0,
maxValue: 100,
weight: 0.1,
gameability: 'medium',
countermeasures: [
'Quality gate for meaningful tests',
'Mutation testing validation'
]
},
// Additional technical metrics...
],
processMetrics: [
{
name: 'Lead Time for Changes',
description: 'Time from commit to production',
calculation: 'Median time in hours',
dataSource: 'Version Control + CI/CD',
owner: 'Engineering Manager',
targetValue: 24,
minValue: 0,
maxValue: 168,
weight: 0.15,
gameability: 'high',
countermeasures: [
'Track from first commit of user story',
'Verify story breakdown timing'
]
},
// Additional process metrics...
],
businessMetrics: [
{
name: 'Feature Usage',
description: 'Percentage of new features actively used',
calculation: 'Features with >10% adoption / Total features',
dataSource: 'Product Analytics',
owner: 'Product Manager',
targetValue: 75,
minValue: 0,
maxValue: 100,
weight: 0.2,
gameability: 'low',
countermeasures: [
'Direct measurement from user analytics',
'Correlation with business outcomes'
]
},
// Additional business metrics...
],
cultureMetrics: [
{
name: 'Psychological Safety',
description: 'Team members feel safe to take risks',
calculation: 'Survey score (1-5 scale)',
dataSource: 'Quarterly Survey',
owner: 'HR',
targetValue: 4.5,
minValue: 1,
maxValue: 5,
weight: 0.1,
gameability: 'medium',
countermeasures: [
'Anonymous surveys',
'External validation'
]
},
// Additional culture metrics...
]
};
function calculateScore(scorecard: BalancedScorecard, actualValues: Record<string, number>): number {
let totalScore = 0;
let totalWeight = 0;
// Calculate technical metrics score
for (const metric of scorecard.technicalMetrics) {
const actualValue = actualValues[metric.name] || 0;
const normalizedValue = Math.min(Math.max((actualValue - metric.minValue) / (metric.maxValue - metric.minValue), 0), 1);
totalScore += normalizedValue * metric.weight;
totalWeight += metric.weight;
}
// Calculate other metric categories...
// Similar implementation for process, business, and culture metrics
return (totalScore / totalWeight) * 100;
}
• Implemented a data quality monitoring system:
# metrics_quality_monitor.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import logging
from datetime import datetime, timedelta
class MetricsQualityMonitor:
def __init__(self, metrics_data_source):
self.data_source = metrics_data_source
self.logger = self._setup_logging()
def _setup_logging(self):
logger = logging.getLogger("metrics_quality")
logger.setLevel(logging.INFO)
handler = logging.FileHandler("metrics_quality.log")
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
return logger
def detect_anomalies(self, metric_name, lookback_days=30, z_threshold=3.0):
"""Detect statistical anomalies in metrics data"""
# Get historical data
end_date = datetime.now()
start_date = end_date - timedelta(days=lookback_days)
df = self.data_source.get_metric_data(metric_name, start_date, end_date)
# Calculate z-scores
mean = df['value'].mean()
std = df['value'].std()
df['z_score'] = (df['value'] - mean) / std if std > 0 else 0
# Identify anomalies
anomalies = df[abs(df['z_score']) > z_threshold]
if not anomalies.empty:
self.logger.warning(f"Detected {len(anomalies)} anomalies in {metric_name}")
for _, row in anomalies.iterrows():
self.logger.warning(f"Anomaly on {row['date']}: value={row['value']}, z-score={row['z_score']:.2f}")
return anomalies
def detect_sudden_improvements(self, metric_name, window_size=7, improvement_threshold=0.5):
"""Detect suspiciously rapid improvements in metrics"""
# Get recent data
end_date = datetime.now()
start_date = end_date - timedelta(days=window_size * 2)
df = self.data_source.get_metric_data(metric_name, start_date, end_date)
# Calculate rolling averages
df['rolling_avg'] = df['value'].rolling(window=window_size).mean()
# Skip rows with NaN rolling averages
df = df.dropna()
# Calculate percent change
df['pct_change'] = df['rolling_avg'].pct_change()
# Identify suspicious improvements
if metric_name in ['lead_time', 'time_to_restore', 'change_failure_rate']:
# For these metrics, improvement is a decrease
suspicious = df[df['pct_change'] < -improvement_threshold]
else:
# For other metrics, improvement is an increase
suspicious = df[df['pct_change'] > improvement_threshold]
if not suspicious.empty:
self.logger.warning(f"Detected {len(suspicious)} suspicious improvements in {metric_name}")
for _, row in suspicious.iterrows():
self.logger.warning(f"Suspicious improvement on {row['date']}: change={row['pct_change']:.2f}")
return suspicious
def correlation_analysis(self, technical_metric, business_metric, lookback_days=90):
"""Analyze correlation between technical and business metrics"""
# Get historical data
end_date = datetime.now()
start_date = end_date - timedelta(days=lookback_days)
tech_df = self.data_source.get_metric_data(technical_metric, start_date, end_date)
biz_df = self.data_source.get_metric_data(business_metric, start_date, end_date)
# Merge on date
merged_df = pd.merge(tech_df, biz_df, on='date', suffixes=('_tech', '_biz'))
# Calculate correlation
correlation, p_value = stats.pearsonr(merged_df['value_tech'], merged_df['value_biz'])
self.logger.info(f"Correlation between {technical_metric} and {business_metric}: {correlation:.2f} (p={p_value:.4f})")
# Check for weak correlation
if abs(correlation) < 0.3 or p_value > 0.05:
self.logger.warning(f"Weak or insignificant correlation between {technical_metric} and {business_metric}")
return correlation, p_value, merged_df
def generate_report(self):
"""Generate a comprehensive metrics quality report"""
# Implementation details for report generation
pass
# Example usage
if __name__ == "__main__":
from metrics_data_source import MetricsDataSource # Hypothetical data source
data_source = MetricsDataSource()
monitor = MetricsQualityMonitor(data_source)
# Check for anomalies in key metrics
for metric in ['deployment_frequency', 'lead_time', 'change_failure_rate', 'time_to_restore']:
monitor.detect_anomalies(metric)
monitor.detect_sudden_improvements(metric)
# Check correlation with business metrics
monitor.correlation_analysis('deployment_frequency', 'customer_satisfaction')
monitor.correlation_analysis('lead_time', 'feature_adoption_rate')
# Generate report
monitor.generate_report()
Lessons Learned:
Metrics can drive behavior, but not always in the intended direction.
How to Avoid:
Implement a balanced set of metrics that are harder to game.
Correlate technical metrics with business outcomes.
Regularly audit the data behind the metrics.
Focus on trends rather than absolute values.
Create a culture of continuous improvement rather than metric targets.
No summary provided
What Happened:
After implementing a new performance dashboard, teams began optimizing their services to improve their specific metrics. However, the overall system performance and user experience deteriorated despite individual metrics showing improvement.
Diagnosis Steps:
Analyzed the correlation between different performance metrics.
Reviewed recent optimization changes made by teams.
Collected user experience data and compared with technical metrics.
Examined the incentive structure around performance metrics.
Conducted end-to-end performance testing.
Root Cause:
Teams were optimizing for isolated metrics without understanding the system-wide impact. For example: 1. The checkout service team optimized for CPU utilization by batching requests, which improved their resource metrics but increased end-to-end latency 2. The product catalog team optimized for response time by aggressively caching, which improved their latency metrics but increased memory usage and caused cache invalidation issues 3. The recommendation engine team optimized for algorithm accuracy, which improved their relevance metrics but significantly increased computational load and latency
Fix/Workaround:
• Short-term: Implemented a holistic performance testing framework:
// performance_test.go
package performance
import (
"context"
"fmt"
"testing"
"time"
"github.com/stretchr/testify/assert"
)
// ServiceMetrics represents the metrics for a single service
type ServiceMetrics struct {
ResponseTime time.Duration
Throughput float64
ErrorRate float64
CPUUtilization float64
MemoryUsage float64
DependencyCalls int
}
// SystemMetrics represents the metrics for the entire system
type SystemMetrics struct {
EndToEndLatency time.Duration
TotalResourceUsage float64
UserExperienceScore float64
BusinessTransactions float64
ServiceMetrics map[string]ServiceMetrics
CrossServiceLatencies map[string]map[string]time.Duration
}
// PerformanceTest runs a comprehensive performance test
func PerformanceTest(t *testing.T, testCase string, load int, duration time.Duration) {
// Setup test environment
ctx, cancel := context.WithTimeout(context.Background(), duration)
defer cancel()
// Run the test
metrics, err := runLoadTest(ctx, testCase, load)
assert.NoError(t, err, "Load test should complete without errors")
// Verify individual service metrics
for service, serviceMetrics := range metrics.ServiceMetrics {
assert.Less(t, serviceMetrics.ResponseTime, getThreshold(service, "response_time"),
fmt.Sprintf("%s response time exceeds threshold", service))
assert.Greater(t, serviceMetrics.Throughput, getThreshold(service, "throughput"),
fmt.Sprintf("%s throughput below threshold", service))
assert.Less(t, serviceMetrics.ErrorRate, getThreshold(service, "error_rate"),
fmt.Sprintf("%s error rate exceeds threshold", service))
assert.Less(t, serviceMetrics.CPUUtilization, getThreshold(service, "cpu_utilization"),
fmt.Sprintf("%s CPU utilization exceeds threshold", service))
assert.Less(t, serviceMetrics.MemoryUsage, getThreshold(service, "memory_usage"),
fmt.Sprintf("%s memory usage exceeds threshold", service))
}
// Verify system-wide metrics
assert.Less(t, metrics.EndToEndLatency, getSystemThreshold("end_to_end_latency"),
"End-to-end latency exceeds threshold")
assert.Less(t, metrics.TotalResourceUsage, getSystemThreshold("total_resource_usage"),
"Total resource usage exceeds threshold")
assert.Greater(t, metrics.UserExperienceScore, getSystemThreshold("user_experience_score"),
"User experience score below threshold")
assert.Greater(t, metrics.BusinessTransactions, getSystemThreshold("business_transactions"),
"Business transaction rate below threshold")
// Verify cross-service latencies
for source, destinations := range metrics.CrossServiceLatencies {
for destination, latency := range destinations {
assert.Less(t, latency, getCrossServiceThreshold(source, destination),
fmt.Sprintf("Latency from %s to %s exceeds threshold", source, destination))
}
}
}
func runLoadTest(ctx context.Context, testCase string, load int) (SystemMetrics, error) {
// Implementation of load test runner
// ...
return SystemMetrics{}, nil
}
func getThreshold(service, metric string) float64 {
// Get threshold from configuration
// ...
return 0
}
func getSystemThreshold(metric string) float64 {
// Get system-wide threshold from configuration
// ...
return 0
}
func getCrossServiceThreshold(source, destination string) time.Duration {
// Get cross-service latency threshold from configuration
// ...
return 0
}
• Long-term: Developed a balanced metrics framework:
// metrics_framework.rs
use std::collections::HashMap;
use std::time::{Duration, Instant};
// Define metric types
pub enum MetricType {
Counter,
Gauge,
Histogram,
Summary,
}
// Define metric dimensions
pub struct MetricDimension {
pub name: String,
pub value: String,
}
// Define a metric
pub struct Metric {
pub name: String,
pub description: String,
pub metric_type: MetricType,
pub dimensions: Vec<MetricDimension>,
pub value: f64,
pub timestamp: Instant,
}
// Define a metric group
pub struct MetricGroup {
pub name: String,
pub metrics: Vec<Metric>,
pub weight: f64,
}
// Define the balanced scorecard
pub struct BalancedScorecard {
pub service_name: String,
pub metric_groups: HashMap<String, MetricGroup>,
pub dependencies: Vec<String>,
pub overall_score: f64,
}
impl BalancedScorecard {
pub fn new(service_name: &str) -> Self {
BalancedScorecard {
service_name: service_name.to_string(),
metric_groups: HashMap::new(),
dependencies: Vec::new(),
overall_score: 0.0,
}
}
pub fn add_metric_group(&mut self, name: &str, weight: f64) {
self.metric_groups.insert(name.to_string(), MetricGroup {
name: name.to_string(),
metrics: Vec::new(),
weight,
});
}
pub fn add_metric(&mut self, group_name: &str, metric: Metric) {
if let Some(group) = self.metric_groups.get_mut(group_name) {
group.metrics.push(metric);
}
}
pub fn add_dependency(&mut self, dependency: &str) {
self.dependencies.push(dependency.to_string());
}
pub fn calculate_score(&mut self) -> f64 {
let mut total_score = 0.0;
let mut total_weight = 0.0;
for (_, group) in &self.metric_groups {
let group_score = self.calculate_group_score(group);
total_score += group_score * group.weight;
total_weight += group.weight;
}
self.overall_score = if total_weight > 0.0 {
total_score / total_weight
} else {
0.0
};
self.overall_score
}
fn calculate_group_score(&self, group: &MetricGroup) -> f64 {
// Implementation of group score calculation
// This would include normalization and weighting of individual metrics
0.0
}
}
// Define the system-wide metrics aggregator
pub struct SystemMetricsAggregator {
pub scorecards: HashMap<String, BalancedScorecard>,
pub service_dependencies: HashMap<String, Vec<String>>,
pub critical_paths: Vec<Vec<String>>,
}
impl SystemMetricsAggregator {
pub fn new() -> Self {
SystemMetricsAggregator {
scorecards: HashMap::new(),
service_dependencies: HashMap::new(),
critical_paths: Vec::new(),
}
}
pub fn add_scorecard(&mut self, scorecard: BalancedScorecard) {
let service_name = scorecard.service_name.clone();
self.scorecards.insert(service_name.clone(), scorecard);
// Update dependencies
let dependencies = self.scorecards.get(&service_name)
.map(|sc| sc.dependencies.clone())
.unwrap_or_default();
self.service_dependencies.insert(service_name, dependencies);
}
pub fn define_critical_path(&mut self, path: Vec<String>) {
self.critical_paths.push(path);
}
pub fn calculate_system_health(&self) -> f64 {
// Calculate overall system health based on service scores and critical paths
let mut total_score = 0.0;
// Weight critical paths more heavily
for path in &self.critical_paths {
let path_score = self.calculate_path_score(path);
total_score += path_score;
}
// Normalize by number of critical paths
if !self.critical_paths.is_empty() {
total_score /= self.critical_paths.len() as f64;
}
total_score
}
fn calculate_path_score(&self, path: &[String]) -> f64 {
// Calculate the score for a critical path
// This would consider the weakest link in the chain
let mut min_score = 1.0;
for service in path {
if let Some(scorecard) = self.scorecards.get(service) {
min_score = min_score.min(scorecard.overall_score);
}
}
min_score
}
pub fn identify_bottlenecks(&self) -> Vec<String> {
// Identify services that are bottlenecks in the system
let mut bottlenecks = Vec::new();
// Implementation of bottleneck detection algorithm
// This would consider service scores, dependencies, and critical paths
bottlenecks
}
}
• Created a unified performance dashboard:
// dashboard.js
import React, { useState, useEffect } from 'react';
import { Line, Bar, Radar } from 'react-chartjs-2';
import {
Box,
Grid,
Typography,
Paper,
Tabs,
Tab,
Select,
MenuItem,
FormControl,
InputLabel,
Slider,
Switch,
FormControlLabel
} from '@material-ui/core';
// Define the dashboard component
const PerformanceDashboard = () => {
const [timeRange, setTimeRange] = useState('1d');
const [services, setServices] = useState([]);
const [selectedServices, setSelectedServices] = useState([]);
const [metrics, setMetrics] = useState({});
const [correlations, setCorrelations] = useState([]);
const [anomalies, setAnomalies] = useState([]);
const [viewMode, setViewMode] = useState('service');
const [showBusinessImpact, setShowBusinessImpact] = useState(true);
// Fetch data on component mount and when timeRange changes
useEffect(() => {
fetchServices();
fetchMetrics(timeRange);
fetchCorrelations(timeRange);
fetchAnomalies(timeRange);
}, [timeRange]);
// Fetch services
const fetchServices = async () => {
try {
const response = await fetch('/api/services');
const data = await response.json();
setServices(data);
setSelectedServices(data.slice(0, 3).map(s => s.id)); // Select first 3 by default
} catch (error) {
console.error('Error fetching services:', error);
}
};
// Fetch metrics
const fetchMetrics = async (range) => {
try {
const response = await fetch(`/api/metrics?timeRange=${range}`);
const data = await response.json();
setMetrics(data);
} catch (error) {
console.error('Error fetching metrics:', error);
}
};
// Fetch correlations
const fetchCorrelations = async (range) => {
try {
const response = await fetch(`/api/correlations?timeRange=${range}`);
const data = await response.json();
setCorrelations(data);
} catch (error) {
console.error('Error fetching correlations:', error);
}
};
// Fetch anomalies
const fetchAnomalies = async (range) => {
try {
const response = await fetch(`/api/anomalies?timeRange=${range}`);
const data = await response.json();
setAnomalies(data);
} catch (error) {
console.error('Error fetching anomalies:', error);
}
};
// Handle service selection
const handleServiceChange = (event) => {
setSelectedServices(event.target.value);
};
// Handle time range change
const handleTimeRangeChange = (event, newValue) => {
setTimeRange(newValue);
};
// Handle view mode change
const handleViewModeChange = (event, newValue) => {
setViewMode(newValue);
};
// Handle business impact toggle
const handleBusinessImpactChange = (event) => {
setShowBusinessImpact(event.target.checked);
};
// Render service metrics
const renderServiceMetrics = () => {
return (
<Grid container spacing={3}>
{selectedServices.map(serviceId => {
const service = services.find(s => s.id === serviceId);
const serviceMetrics = metrics.services?.[serviceId] || {};
return (
<Grid item xs={12} md={6} lg={4} key={serviceId}>
<Paper elevation={3} style={{ padding: 16 }}>
<Typography variant="h6">{service?.name || 'Unknown Service'}</Typography>
<Box mt={2}>
<Line
data={{
labels: serviceMetrics.timestamps || [],
datasets: [
{
label: 'Response Time (ms)',
data: serviceMetrics.responseTime || [],
borderColor: 'rgba(75, 192, 192, 1)',
tension: 0.1
},
{
label: 'Error Rate (%)',
data: serviceMetrics.errorRate || [],
borderColor: 'rgba(255, 99, 132, 1)',
tension: 0.1
}
]
}}
options={{
scales: {
y: {
beginAtZero: true
}
}
}}
/>
</Box>
<Box mt={3}>
<Bar
data={{
labels: ['CPU', 'Memory', 'Network', 'Disk'],
datasets: [
{
label: 'Resource Usage (%)',
data: [
serviceMetrics.cpuUsage || 0,
serviceMetrics.memoryUsage || 0,
serviceMetrics.networkUsage || 0,
serviceMetrics.diskUsage || 0
],
backgroundColor: [
'rgba(75, 192, 192, 0.6)',
'rgba(54, 162, 235, 0.6)',
'rgba(153, 102, 255, 0.6)',
'rgba(255, 159, 64, 0.6)'
]
}
]
}}
options={{
scales: {
y: {
beginAtZero: true,
max: 100
}
}
}}
/>
</Box>
{showBusinessImpact && (
<Box mt={3}>
<Typography variant="subtitle1">Business Impact</Typography>
<Radar
data={{
labels: ['User Experience', 'Revenue', 'Conversion', 'Retention', 'Cost'],
datasets: [
{
label: 'Impact Score',
data: [
serviceMetrics.userExperienceImpact || 0,
serviceMetrics.revenueImpact || 0,
serviceMetrics.conversionImpact || 0,
serviceMetrics.retentionImpact || 0,
serviceMetrics.costImpact || 0
],
backgroundColor: 'rgba(255, 99, 132, 0.2)',
borderColor: 'rgba(255, 99, 132, 1)',
pointBackgroundColor: 'rgba(255, 99, 132, 1)'
}
]
}}
options={{
scales: {
r: {
angleLines: {
display: true
},
suggestedMin: 0,
suggestedMax: 100
}
}
}}
/>
</Box>
)}
</Paper>
</Grid>
);
})}
</Grid>
);
};
// Render system metrics
const renderSystemMetrics = () => {
return (
<Grid container spacing={3}>
<Grid item xs={12}>
<Paper elevation={3} style={{ padding: 16 }}>
<Typography variant="h6">End-to-End Performance</Typography>
<Box mt={2}>
<Line
data={{
labels: metrics.system?.timestamps || [],
datasets: [
{
label: 'End-to-End Latency (ms)',
data: metrics.system?.endToEndLatency || [],
borderColor: 'rgba(75, 192, 192, 1)',
tension: 0.1
},
{
label: 'User Perceived Latency (ms)',
data: metrics.system?.userPerceivedLatency || [],
borderColor: 'rgba(153, 102, 255, 1)',
tension: 0.1
}
]
}}
options={{
scales: {
y: {
beginAtZero: true
}
}
}}
/>
</Box>
</Paper>
</Grid>
<Grid item xs={12} md={6}>
<Paper elevation={3} style={{ padding: 16 }}>
<Typography variant="h6">System Resource Usage</Typography>
<Box mt={2}>
<Line
data={{
labels: metrics.system?.timestamps || [],
datasets: [
{
label: 'Total CPU Usage (%)',
data: metrics.system?.totalCpuUsage || [],
borderColor: 'rgba(75, 192, 192, 1)',
tension: 0.1
},
{
label: 'Total Memory Usage (%)',
data: metrics.system?.totalMemoryUsage || [],
borderColor: 'rgba(54, 162, 235, 1)',
tension: 0.1
}
]
}}
options={{
scales: {
y: {
beginAtZero: true,
max: 100
}
}
}}
/>
</Box>
</Paper>
</Grid>
<Grid item xs={12} md={6}>
<Paper elevation={3} style={{ padding: 16 }}>
<Typography variant="h6">Business Metrics</Typography>
<Box mt={2}>
<Line
data={{
labels: metrics.business?.timestamps || [],
datasets: [
{
label: 'Conversion Rate (%)',
data: metrics.business?.conversionRate || [],
borderColor: 'rgba(255, 99, 132, 1)',
tension: 0.1
},
{
label: 'Revenue ($)',
data: metrics.business?.revenue || [],
borderColor: 'rgba(255, 159, 64, 1)',
tension: 0.1,
yAxisID: 'y1'
}
]
}}
options={{
scales: {
y: {
beginAtZero: true,
position: 'left',
title: {
display: true,
text: 'Conversion Rate (%)'
}
},
y1: {
beginAtZero: true,
position: 'right',
grid: {
drawOnChartArea: false
},
title: {
display: true,
text: 'Revenue ($)'
}
}
}
}}
/>
</Box>
</Paper>
</Grid>
</Grid>
);
};
// Render correlation view
const renderCorrelations = () => {
return (
<Grid container spacing={3}>
<Grid item xs={12}>
<Paper elevation={3} style={{ padding: 16 }}>
<Typography variant="h6">Metric Correlations</Typography>
<Box mt={2} style={{ height: 500 }}>
{/* Correlation matrix visualization would go here */}
{/* This would typically be a heatmap or network graph */}
</Box>
</Paper>
</Grid>
</Grid>
);
};
return (
<Box p={3}>
<Typography variant="h4" gutterBottom>Performance Dashboard</Typography>
<Box mb={3}>
<Grid container spacing={3} alignItems="center">
<Grid item xs={12} md={4}>
<FormControl fullWidth>
<InputLabel>Services</InputLabel>
<Select
multiple
value={selectedServices}
onChange={handleServiceChange}
renderValue={(selected) => selected.map(id =>
services.find(s => s.id === id)?.name || id
).join(', ')}
>
{services.map(service => (
<MenuItem key={service.id} value={service.id}>
{service.name}
</MenuItem>
))}
</Select>
</FormControl>
</Grid>
<Grid item xs={12} md={4}>
<FormControl fullWidth>
<InputLabel>Time Range</InputLabel>
<Select
value={timeRange}
onChange={(e) => setTimeRange(e.target.value)}
>
<MenuItem value="1h">Last Hour</MenuItem>
<MenuItem value="6h">Last 6 Hours</MenuItem>
<MenuItem value="1d">Last Day</MenuItem>
<MenuItem value="7d">Last Week</MenuItem>
<MenuItem value="30d">Last Month</MenuItem>
</Select>
</FormControl>
</Grid>
<Grid item xs={12} md={4}>
<FormControlLabel
control={
<Switch
checked={showBusinessImpact}
onChange={handleBusinessImpactChange}
color="primary"
/>
}
label="Show Business Impact"
/>
</Grid>
</Grid>
</Box>
<Box mb={3}>
<Tabs
value={viewMode}
onChange={handleViewModeChange}
indicatorColor="primary"
textColor="primary"
centered
>
<Tab label="Service View" value="service" />
<Tab label="System View" value="system" />
<Tab label="Correlations" value="correlations" />
</Tabs>
</Box>
{viewMode === 'service' && renderServiceMetrics()}
{viewMode === 'system' && renderSystemMetrics()}
{viewMode === 'correlations' && renderCorrelations()}
{anomalies.length > 0 && (
<Box mt={4}>
<Typography variant="h6" gutterBottom>Detected Anomalies</Typography>
<Paper elevation={3} style={{ padding: 16 }}>
{anomalies.map((anomaly, index) => (
<Box key={index} mb={2}>
<Typography variant="subtitle1" color="error">
{anomaly.description}
</Typography>
<Typography variant="body2">
Detected at: {new Date(anomaly.timestamp).toLocaleString()}
</Typography>
<Typography variant="body2">
Affected services: {anomaly.affectedServices.join(', ')}
</Typography>
</Box>
))}
</Paper>
</Box>
)}
</Box>
);
};
export default PerformanceDashboard;
Lessons Learned:
Performance metrics must be balanced and aligned with overall system goals.
How to Avoid:
Implement a balanced scorecard approach to performance metrics.
Consider the impact of local optimizations on global performance.
Align technical metrics with business outcomes.
Test performance changes in an end-to-end context.
Create a culture of system thinking rather than component optimization.
No summary provided
What Happened:
A DevOps team implemented DORA metrics tracking and reported impressive deployment frequency (multiple times per day) to leadership. However, production quality was deteriorating with increasing customer complaints. When leadership investigated, they discovered that while deployment frequency looked good on paper, the actual value delivered was minimal.
Diagnosis Steps:
Reviewed the deployment frequency calculation methodology.
Analyzed the correlation between deployments and feature delivery.
Examined the definition of "deployment" used in metrics collection.
Compared deployment metrics with customer satisfaction scores.
Investigated the deployment pipeline and release process.
Root Cause:
The team was counting every configuration change and minor patch as a "deployment" in their metrics, artificially inflating their deployment frequency. Many of these deployments contained no meaningful features or fixes. Additionally, the team wasn't measuring other critical DORA metrics like change failure rate, lead time for changes, and mean time to recovery, which would have revealed the quality issues.
Fix/Workaround:
• Short-term: Redefined "deployment" to focus on value delivery:
# Prometheus metric definition with improved deployment criteria
- name: dora_deployment_frequency
type: counter
help: Number of successful deployments to production
labels:
- environment
- team
- contains_features
- contains_fixes
- deployment_type
• Long-term: Implemented a comprehensive DORA metrics framework:
// metrics/dora.go
package metrics
import (
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Deployment Frequency
DeploymentCounter = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "dora_deployments_total",
Help: "Total number of deployments to production",
},
[]string{"team", "service", "contains_features", "contains_fixes", "deployment_type"},
)
// Lead Time for Changes
LeadTimeHistogram = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "dora_lead_time_seconds",
Help: "Time from commit to deployment in seconds",
Buckets: prometheus.ExponentialBuckets(60, 2, 15), // From 1 minute to ~22 days
},
[]string{"team", "service"},
)
// Change Failure Rate
ChangeFailureCounter = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "dora_deployment_failures_total",
Help: "Total number of failed deployments",
},
[]string{"team", "service", "failure_reason"},
)
// Mean Time to Recovery
RecoveryTimeHistogram = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "dora_recovery_time_seconds",
Help: "Time to recover from a failed deployment in seconds",
Buckets: prometheus.ExponentialBuckets(60, 2, 15), // From 1 minute to ~22 days
},
[]string{"team", "service", "failure_reason"},
)
)
// RecordDeployment records a deployment event
func RecordDeployment(team, service, deploymentType string, containsFeatures, containsFixes bool) {
features := "false"
if containsFeatures {
features = "true"
}
fixes := "false"
if containsFixes {
fixes = "true"
}
DeploymentCounter.WithLabelValues(team, service, features, fixes, deploymentType).Inc()
}
// RecordLeadTime records the lead time for a change
func RecordLeadTime(team, service string, commitTime, deployTime time.Time) {
leadTime := deployTime.Sub(commitTime).Seconds()
LeadTimeHistogram.WithLabelValues(team, service).Observe(leadTime)
}
// RecordDeploymentFailure records a deployment failure
func RecordDeploymentFailure(team, service, reason string) {
ChangeFailureCounter.WithLabelValues(team, service, reason).Inc()
}
// RecordRecoveryTime records the time to recover from a failure
func RecordRecoveryTime(team, service, reason string, failureTime, recoveryTime time.Time) {
recoveryDuration := recoveryTime.Sub(failureTime).Seconds()
RecoveryTimeHistogram.WithLabelValues(team, service, reason).Observe(recoveryDuration)
}
• Created a CI/CD pipeline integration to automatically collect metrics:
# Jenkins pipeline with DORA metrics integration
pipeline {
agent any
environment {
TEAM_NAME = "platform-team"
SERVICE_NAME = "payment-service"
COMMIT_TIME = ""
CONTAINS_FEATURES = "false"
CONTAINS_FIXES = "false"
DEPLOYMENT_TYPE = "regular"
}
stages {
stage('Prepare') {
steps {
script {
// Get commit timestamp
COMMIT_TIME = sh(script: 'git show -s --format=%ct HEAD', returnStdout: true).trim()
// Determine if deployment contains features or fixes
def commitMessages = sh(script: 'git log --pretty=format:"%s" $(git describe --tags --abbrev=0)..HEAD', returnStdout: true).trim()
if (commitMessages.contains("feat:") || commitMessages.contains("feature:")) {
CONTAINS_FEATURES = "true"
}
if (commitMessages.contains("fix:") || commitMessages.contains("bugfix:")) {
CONTAINS_FIXES = "true"
}
// Determine deployment type
if (env.BRANCH_NAME == 'main' || env.BRANCH_NAME == 'master') {
DEPLOYMENT_TYPE = "regular"
} else if (env.BRANCH_NAME.startsWith('hotfix/')) {
DEPLOYMENT_TYPE = "hotfix"
} else if (env.BRANCH_NAME.startsWith('release/')) {
DEPLOYMENT_TYPE = "release"
}
}
}
}
// Build, test, and other stages...
stage('Deploy') {
steps {
script {
try {
// Deployment steps...
sh 'kubectl apply -f kubernetes/deployment.yaml'
// Record successful deployment
def deployTime = sh(script: 'date +%s', returnStdout: true).trim()
// Record DORA metrics
sh """
curl -X POST http://metrics-server:8080/metrics/deployment \\
-H 'Content-Type: application/json' \\
-d '{
"team": "${TEAM_NAME}",
"service": "${SERVICE_NAME}",
"contains_features": ${CONTAINS_FEATURES},
"contains_fixes": ${CONTAINS_FIXES},
"deployment_type": "${DEPLOYMENT_TYPE}",
"commit_time": ${COMMIT_TIME},
"deploy_time": ${deployTime}
}'
"""
} catch (Exception e) {
// Record deployment failure
sh """
curl -X POST http://metrics-server:8080/metrics/deployment/failure \\
-H 'Content-Type: application/json' \\
-d '{
"team": "${TEAM_NAME}",
"service": "${SERVICE_NAME}",
"failure_reason": "deployment_error",
"failure_time": '$(date +%s)'
}'
"""
throw e
}
}
}
}
}
post {
failure {
script {
// Additional failure handling...
}
}
success {
script {
// Additional success handling...
}
}
}
}
• Implemented a comprehensive metrics dashboard:
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
},
{
"datasource": "Prometheus",
"enable": true,
"expr": "changes(dora_deployments_total{team=\"$team\", service=\"$service\"}[1m]) > 0",
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Deployments",
"showIn": 0,
"tags": [],
"type": "tags"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 42,
"links": [],
"panels": [
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 20,
"panels": [],
"title": "DORA Metrics Overview",
"type": "row"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "red",
"value": null
},
{
"color": "yellow",
"value": 1
},
{
"color": "green",
"value": 7
}
]
},
"unit": "deployments"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 0,
"y": 1
},
"id": 2,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"sum"
],
"fields": "",
"values": false
},
"text": {},
"textMode": "auto"
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\"}[7d]))",
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "Deployment Frequency (Weekly)",
"type": "stat"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 86400
},
{
"color": "red",
"value": 604800
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 6,
"y": 1
},
"id": 4,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"mean"
],
"fields": "",
"values": false
},
"text": {},
"textMode": "auto"
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "histogram_quantile(0.5, sum by (le) (rate(dora_lead_time_seconds_bucket{team=\"$team\", service=\"$service\"}[30d])))",
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "Lead Time for Changes (Median)",
"type": "stat"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"max": 100,
"min": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 15
},
{
"color": "red",
"value": 30
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 12,
"y": 1
},
"id": 6,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"text": {},
"textMode": "auto"
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "sum(increase(dora_deployment_failures_total{team=\"$team\", service=\"$service\"}[30d])) / sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\"}[30d])) * 100",
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "Change Failure Rate (30d)",
"type": "stat"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 3600
},
{
"color": "red",
"value": 86400
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 18,
"y": 1
},
"id": 8,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"mean"
],
"fields": "",
"values": false
},
"text": {},
"textMode": "auto"
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "histogram_quantile(0.5, sum by (le) (rate(dora_recovery_time_seconds_bucket{team=\"$team\", service=\"$service\"}[30d])))",
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "Mean Time to Recovery",
"type": "stat"
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 9
},
"id": 22,
"panels": [],
"title": "Deployment Details",
"type": "row"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "bars",
"fillOpacity": 100,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "normal"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 10
},
"id": 10,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "sum by (deployment_type) (increase(dora_deployments_total{team=\"$team\", service=\"$service\"}[1d]))",
"interval": "1d",
"legendFormat": "{{deployment_type}}",
"refId": "A"
}
],
"title": "Daily Deployments by Type",
"type": "timeseries"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 10
},
"id": 12,
"options": {
"legend": {
"calcs": [
"mean",
"max",
"min"
],
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "histogram_quantile(0.5, sum by (le) (rate(dora_lead_time_seconds_bucket{team=\"$team\", service=\"$service\"}[7d])))",
"interval": "1d",
"legendFormat": "Median",
"refId": "A"
},
{
"expr": "histogram_quantile(0.9, sum by (le) (rate(dora_lead_time_seconds_bucket{team=\"$team\", service=\"$service\"}[7d])))",
"interval": "1d",
"legendFormat": "90th Percentile",
"refId": "B"
}
],
"title": "Lead Time Trend",
"type": "timeseries"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 18
},
"id": 14,
"options": {
"displayMode": "gradient",
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"sum"
],
"fields": "",
"values": false
},
"showUnfilled": true,
"text": {}
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "sum by (failure_reason) (increase(dora_deployment_failures_total{team=\"$team\", service=\"$service\"}[30d]))",
"interval": "",
"legendFormat": "{{failure_reason}}",
"refId": "A"
}
],
"title": "Deployment Failures by Reason (30d)",
"type": "bargauge"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "percentunit"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 18
},
"id": 16,
"options": {
"legend": {
"calcs": [
"mean"
],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "sum(increase(dora_deployment_failures_total{team=\"$team\", service=\"$service\"}[7d])) / sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\"}[7d]))",
"interval": "1d",
"legendFormat": "Change Failure Rate (7d)",
"refId": "A"
}
],
"title": "Change Failure Rate Trend",
"type": "timeseries"
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 26
},
"id": 24,
"panels": [],
"title": "Value Delivery",
"type": "row"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "bars",
"fillOpacity": 100,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "normal"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 27
},
"id": 18,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\", contains_features=\"true\"}[1d]))",
"interval": "1d",
"legendFormat": "With Features",
"refId": "A"
},
{
"expr": "sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\", contains_fixes=\"true\"}[1d]))",
"interval": "1d",
"legendFormat": "With Fixes",
"refId": "B"
},
{
"expr": "sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\", contains_features=\"false\", contains_fixes=\"false\"}[1d]))",
"interval": "1d",
"legendFormat": "Configuration Only",
"refId": "C"
}
],
"title": "Deployments by Content Type",
"type": "timeseries"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "red",
"value": null
},
{
"color": "yellow",
"value": 0.3
},
{
"color": "green",
"value": 0.5
}
]
},
"unit": "percentunit"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 27
},
"id": 26,
"options": {
"displayMode": "gradient",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showUnfilled": true,
"text": {}
},
"pluginVersion": "7.5.5",
"targets": [
{
"expr": "sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\", contains_features=\"true\"}[30d])) / sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\"}[30d]))",
"interval": "",
"legendFormat": "Feature Ratio",
"refId": "A"
},
{
"expr": "sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\", contains_fixes=\"true\"}[30d])) / sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\"}[30d]))",
"interval": "",
"legendFormat": "Fix Ratio",
"refId": "B"
},
{
"expr": "sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\", contains_features=\"false\", contains_fixes=\"false\"}[30d])) / sum(increase(dora_deployments_total{team=\"$team\", service=\"$service\"}[30d]))",
"interval": "",
"legendFormat": "Config-only Ratio",
"refId": "C"
}
],
"title": "Value Delivery Ratio (30d)",
"type": "bargauge"
}
],
"refresh": "5m",
"schemaVersion": 27,
"style": "dark",
"tags": [
"dora",
"devops"
],
"templating": {
"list": [
{
"allValue": null,
"current": {
"selected": false,
"text": "platform-team",
"value": "platform-team"
},
"datasource": "Prometheus",
"definition": "label_values(dora_deployments_total, team)",
"description": null,
"error": null,
"hide": 0,
"includeAll": false,
"label": "Team",
"multi": false,
"name": "team",
"options": [],
"query": {
"query": "label_values(dora_deployments_total, team)",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"selected": false,
"text": "payment-service",
"value": "payment-service"
},
"datasource": "Prometheus",
"definition": "label_values(dora_deployments_total{team=\"$team\"}, service)",
"description": null,
"error": null,
"hide": 0,
"includeAll": false,
"label": "Service",
"multi": false,
"name": "service",
"options": [],
"query": {
"query": "label_values(dora_deployments_total{team=\"$team\"}, service)",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-30d",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "DORA Metrics Dashboard",
"uid": "dora-metrics",
"version": 1
}
Lessons Learned:
DevOps metrics must focus on value delivery, not just deployment frequency.
How to Avoid:
Implement all four DORA metrics, not just deployment frequency.
Define clear criteria for what constitutes a meaningful deployment.
Track the value content of deployments (features, fixes, etc.).
Correlate technical metrics with business outcomes.
Regularly review metrics definitions to ensure they align with business goals.
No summary provided
What Happened:
After implementing DORA metrics (Deployment Frequency, Lead Time for Changes, Mean Time to Restore, Change Failure Rate), a DevOps team noticed a disconnect between the metrics and actual team performance. Despite the metrics showing improvement, teams were still experiencing delivery bottlenecks and quality issues.
Diagnosis Steps:
Reviewed the implementation of each DORA metric.
Analyzed the data sources and collection methods.
Compared metric calculations with industry standards.
Interviewed teams about their development and deployment processes.
Audited the CI/CD pipeline instrumentation.
Root Cause:
The metrics implementation had several flaws: 1. Deployment Frequency counted all deployments, including failed ones and rollbacks. 2. Lead Time for Changes was measured from code merge to production, missing the time from initial commit to code review completion. 3. Mean Time to Restore only tracked incidents logged in the incident management system, missing many smaller issues fixed without formal tickets. 4. Change Failure Rate didn't distinguish between critical and minor failures.
Fix/Workaround:
• Short-term: Corrected the metrics implementation:
# Prometheus metrics configuration
- name: dora_deployment_frequency
help: "Number of successful deployments to production per day"
type: counter
labels:
- service
- environment
- team
- status
- name: dora_lead_time_seconds
help: "Time from first commit to production deployment"
type: histogram
buckets: [3600, 7200, 14400, 28800, 86400, 172800, 604800, 1209600, 2419200]
labels:
- service
- team
- pr_number
- name: dora_time_to_restore_seconds
help: "Time to restore service after an incident"
type: histogram
buckets: [300, 900, 1800, 3600, 7200, 14400, 28800, 86400, 172800]
labels:
- service
- severity
- incident_id
- name: dora_change_failure_rate
help: "Percentage of deployments causing a failure"
type: gauge
labels:
- service
- team
- severity
• Long-term: Implemented a comprehensive metrics collection system:
// dora_metrics.go - DORA metrics collector
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"os"
"strconv"
"strings"
"time"
"github.com/google/go-github/v45/github"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
"golang.org/x/oauth2"
"gopkg.in/yaml.v3"
"net/http"
)
// Configuration
type Config struct {
GitHub struct {
Owner string `yaml:"owner"`
Repos []string `yaml:"repos"`
Token string `yaml:"token"`
PRLabelDone string `yaml:"prLabelDone"`
} `yaml:"github"`
Jenkins struct {
URL string `yaml:"url"`
Username string `yaml:"username"`
Token string `yaml:"token"`
Jobs []struct {
Name string `yaml:"name"`
Environment string `yaml:"environment"`
} `yaml:"jobs"`
} `yaml:"jenkins"`
Jira struct {
URL string `yaml:"url"`
Username string `yaml:"username"`
Token string `yaml:"token"`
Project string `yaml:"project"`
} `yaml:"jira"`
ServiceMapping map[string]string `yaml:"serviceMapping"`
TeamMapping map[string]string `yaml:"teamMapping"`
}
// Prometheus metrics
var (
deploymentFrequency = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "dora_deployment_frequency",
Help: "Number of successful deployments to production per day",
},
[]string{"service", "environment", "team", "status"},
)
leadTimeHistogram = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "dora_lead_time_seconds",
Help: "Time from first commit to production deployment",
Buckets: []float64{3600, 7200, 14400, 28800, 86400, 172800, 604800, 1209600, 2419200},
},
[]string{"service", "team", "pr_number"},
)
timeToRestoreHistogram = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "dora_time_to_restore_seconds",
Help: "Time to restore service after an incident",
Buckets: []float64{300, 900, 1800, 3600, 7200, 14400, 28800, 86400, 172800},
},
[]string{"service", "severity", "incident_id"},
)
changeFailureRate = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "dora_change_failure_rate",
Help: "Percentage of deployments causing a failure",
},
[]string{"service", "team", "severity"},
)
)
func main() {
// Load configuration
configFile, err := os.ReadFile("config.yaml")
if err != nil {
log.Fatalf("Failed to read config file: %v", err)
}
var config Config
if err := yaml.Unmarshal(configFile, &config); err != nil {
log.Fatalf("Failed to parse config: %v", err)
}
// Set up GitHub client
ctx := context.Background()
ts := oauth2.StaticTokenSource(
&oauth2.Token{AccessToken: config.GitHub.Token},
)
tc := oauth2.NewClient(ctx, ts)
githubClient := github.NewClient(tc)
// Start HTTP server for Prometheus metrics
http.Handle("/metrics", promhttp.Handler())
go func() {
log.Fatal(http.ListenAndServe(":8080", nil))
}()
// Start collectors
go collectDeploymentMetrics(config)
go collectLeadTimeMetrics(ctx, githubClient, config)
go collectIncidentMetrics(config)
go calculateChangeFailureRate(config)
// Keep the main thread running
select {}
}
func collectDeploymentMetrics(config Config) {
for {
// For each Jenkins job that deploys to production
for _, job := range config.Jenkins.Jobs {
// Get recent builds
builds, err := getJenkinsBuilds(config, job.Name)
if err != nil {
log.Printf("Failed to get builds for job %s: %v", job.Name, err)
continue
}
for _, build := range builds {
// Only count successful builds
if build.Result == "SUCCESS" {
// Map job to service and team
service := job.Name
if mapped, ok := config.ServiceMapping[job.Name]; ok {
service = mapped
}
team := "unknown"
if mapped, ok := config.TeamMapping[service]; ok {
team = mapped
}
// Increment deployment counter
deploymentFrequency.WithLabelValues(service, job.Environment, team, "success").Inc()
}
}
}
// Sleep for 15 minutes before next collection
time.Sleep(15 * time.Minute)
}
}
func collectLeadTimeMetrics(ctx context.Context, client *github.Client, config Config) {
for {
// For each repository
for _, repo := range config.GitHub.Repos {
// Get merged PRs
prs, err := getMergedPRs(ctx, client, config.GitHub.Owner, repo)
if err != nil {
log.Printf("Failed to get PRs for repo %s: %v", repo, err)
continue
}
for _, pr := range prs {
// Get first commit time
firstCommit, err := getFirstCommitTime(ctx, client, config.GitHub.Owner, repo, pr.Number)
if err != nil {
log.Printf("Failed to get first commit for PR #%d: %v", pr.Number, err)
continue
}
// Get deployment time
deploymentTime, err := getDeploymentTime(config, repo, pr.Number)
if err != nil {
log.Printf("Failed to get deployment time for PR #%d: %v", pr.Number, err)
continue
}
// Calculate lead time
leadTime := deploymentTime.Sub(firstCommit).Seconds()
// Map repo to service and team
service := repo
if mapped, ok := config.ServiceMapping[repo]; ok {
service = mapped
}
team := "unknown"
if mapped, ok := config.TeamMapping[service]; ok {
team = mapped
}
// Record lead time
leadTimeHistogram.WithLabelValues(
service,
team,
strconv.Itoa(pr.Number),
).Observe(leadTime)
}
}
// Sleep for 1 hour before next collection
time.Sleep(1 * time.Hour)
}
}
func collectIncidentMetrics(config Config) {
for {
// Get incidents from Jira
incidents, err := getJiraIncidents(config)
if err != nil {
log.Printf("Failed to get incidents: %v", err)
time.Sleep(15 * time.Minute)
continue
}
for _, incident := range incidents {
// Calculate time to restore
timeToRestore := incident.ResolutionTime.Sub(incident.CreatedTime).Seconds()
// Record time to restore
timeToRestoreHistogram.WithLabelValues(
incident.Service,
incident.Severity,
incident.ID,
).Observe(timeToRestore)
}
// Sleep for 15 minutes before next collection
time.Sleep(15 * time.Minute)
}
}
func calculateChangeFailureRate(config Config) {
for {
// For each service
for service, team := range config.TeamMapping {
// Get total deployments
totalDeployments, err := getTotalDeployments(config, service)
if err != nil {
log.Printf("Failed to get total deployments for service %s: %v", service, err)
continue
}
// Get failed deployments
failedDeployments, err := getFailedDeployments(config, service)
if err != nil {
log.Printf("Failed to get failed deployments for service %s: %v", service, err)
continue
}
// Calculate failure rate
var failureRate float64
if totalDeployments > 0 {
failureRate = float64(failedDeployments) / float64(totalDeployments) * 100
}
// Record failure rate
changeFailureRate.WithLabelValues(service, team, "all").Set(failureRate)
// Calculate critical failure rate
criticalFailures, err := getCriticalFailures(config, service)
if err != nil {
log.Printf("Failed to get critical failures for service %s: %v", service, err)
continue
}
var criticalFailureRate float64
if totalDeployments > 0 {
criticalFailureRate = float64(criticalFailures) / float64(totalDeployments) * 100
}
// Record critical failure rate
changeFailureRate.WithLabelValues(service, team, "critical").Set(criticalFailureRate)
}
// Sleep for 1 hour before next calculation
time.Sleep(1 * time.Hour)
}
}
// Helper functions (simplified for brevity)
func getJenkinsBuilds(config Config, jobName string) ([]struct{ Result string }, error) {
// Implementation would use Jenkins API to get builds
return []struct{ Result string }{
{Result: "SUCCESS"},
{Result: "FAILURE"},
{Result: "SUCCESS"},
}, nil
}
func getMergedPRs(ctx context.Context, client *github.Client, owner, repo string) ([]*github.PullRequest, error) {
// Implementation would use GitHub API to get merged PRs
return []*github.PullRequest{
{Number: 123},
{Number: 124},
}, nil
}
func getFirstCommitTime(ctx context.Context, client *github.Client, owner, repo string, prNumber int) (time.Time, error) {
// Implementation would use GitHub API to get first commit time
return time.Now().Add(-7 * 24 * time.Hour), nil
}
func getDeploymentTime(config Config, repo string, prNumber int) (time.Time, error) {
// Implementation would use deployment logs or CI/CD system to get deployment time
return time.Now(), nil
}
type Incident struct {
ID string
Service string
Severity string
CreatedTime time.Time
ResolutionTime time.Time
}
func getJiraIncidents(config Config) ([]Incident, error) {
// Implementation would use Jira API to get incidents
return []Incident{
{
ID: "INC-123",
Service: "api-gateway",
Severity: "high",
CreatedTime: time.Now().Add(-24 * time.Hour),
ResolutionTime: time.Now().Add(-23 * time.Hour),
},
}, nil
}
func getTotalDeployments(config Config, service string) (int, error) {
// Implementation would query deployment logs or CI/CD system
return 100, nil
}
func getFailedDeployments(config Config, service string) (int, error) {
// Implementation would query deployment logs or CI/CD system
return 5, nil
}
func getCriticalFailures(config Config, service string) (int, error) {
// Implementation would query incident management system
return 2, nil
}
• Created a Rust-based DORA metrics dashboard:
// dora_dashboard.rs
use actix_web::{web, App, HttpResponse, HttpServer, Responder};
use chrono::{DateTime, Duration, Utc};
use reqwest::Client;
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::Mutex;
use tokio::time;
#[derive(Debug, Clone, Serialize, Deserialize)]
struct DoraMetrics {
service: String,
team: String,
deployment_frequency: f64,
lead_time_days: f64,
mttr_hours: f64,
change_failure_rate: f64,
timestamp: DateTime<Utc>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct ServiceMetrics {
service: String,
team: String,
metrics: Vec<DoraMetrics>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct TeamPerformance {
team: String,
performance_level: String,
metrics: DoraMetrics,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct Dashboard {
timestamp: DateTime<Utc>,
services: Vec<ServiceMetrics>,
teams: Vec<TeamPerformance>,
organization_metrics: DoraMetrics,
}
#[actix_web::main]
async fn main() -> std::io::Result<()> {
// Create shared state
let dashboard = Arc::new(Mutex::new(Dashboard {
timestamp: Utc::now(),
services: Vec::new(),
teams: Vec::new(),
organization_metrics: DoraMetrics {
service: "organization".to_string(),
team: "all".to_string(),
deployment_frequency: 0.0,
lead_time_days: 0.0,
mttr_hours: 0.0,
change_failure_rate: 0.0,
timestamp: Utc::now(),
},
}));
// Start metrics collector
let collector_dashboard = dashboard.clone();
tokio::spawn(async move {
collect_metrics(collector_dashboard).await;
});
// Start HTTP server
HttpServer::new(move || {
App::new()
.app_data(web::Data::new(dashboard.clone()))
.route("/api/dashboard", web::get().to(get_dashboard))
.route("/api/services", web::get().to(get_services))
.route("/api/teams", web::get().to(get_teams))
.route(
"/api/service/{service}",
web::get().to(get_service_metrics),
)
.route("/api/team/{team}", web::get().to(get_team_metrics))
})
.bind("0.0.0.0:8080")?
.run()
.await
}
async fn collect_metrics(dashboard: Arc<Mutex<Dashboard>>) {
let client = Client::new();
let prometheus_url = std::env::var("PROMETHEUS_URL").unwrap_or_else(|_| "http://prometheus:9090".to_string());
loop {
// Collect metrics from Prometheus
match collect_prometheus_metrics(&client, &prometheus_url).await {
Ok(metrics) => {
// Update dashboard
let mut dashboard = dashboard.lock().await;
dashboard.timestamp = Utc::now();
dashboard.services = metrics.services;
dashboard.teams = metrics.teams;
dashboard.organization_metrics = metrics.organization_metrics;
}
Err(e) => {
eprintln!("Failed to collect metrics: {}", e);
}
}
// Sleep for 15 minutes
time::sleep(time::Duration::from_secs(15 * 60)).await;
}
}
async fn collect_prometheus_metrics(
client: &Client,
prometheus_url: &str,
) -> Result<Dashboard, Box<dyn std::error::Error>> {
// Query deployment frequency
let deployment_frequency = query_prometheus(
client,
prometheus_url,
"sum(increase(dora_deployment_frequency{status='success',environment='production'}[30d])) by (service, team) / 30",
)
.await?;
// Query lead time
let lead_time = query_prometheus(
client,
prometheus_url,
"sum(rate(dora_lead_time_seconds_sum[30d])) by (service, team) / sum(rate(dora_lead_time_seconds_count[30d])) by (service, team) / 86400",
)
.await?;
// Query MTTR
let mttr = query_prometheus(
client,
prometheus_url,
"sum(rate(dora_time_to_restore_seconds_sum[30d])) by (service, severity) / sum(rate(dora_time_to_restore_seconds_count[30d])) by (service, severity) / 3600",
)
.await?;
// Query change failure rate
let change_failure_rate = query_prometheus(
client,
prometheus_url,
"avg(dora_change_failure_rate) by (service, team)",
)
.await?;
// Process metrics
let mut services_map: HashMap<String, ServiceMetrics> = HashMap::new();
let mut teams_map: HashMap<String, Vec<DoraMetrics>> = HashMap::new();
let mut org_metrics = DoraMetrics {
service: "organization".to_string(),
team: "all".to_string(),
deployment_frequency: 0.0,
lead_time_days: 0.0,
mttr_hours: 0.0,
change_failure_rate: 0.0,
timestamp: Utc::now(),
};
// Process deployment frequency
for (labels, value) in deployment_frequency {
let service = labels.get("service").cloned().unwrap_or_default();
let team = labels.get("team").cloned().unwrap_or_default();
let metrics = DoraMetrics {
service: service.clone(),
team: team.clone(),
deployment_frequency: value,
lead_time_days: 0.0,
mttr_hours: 0.0,
change_failure_rate: 0.0,
timestamp: Utc::now(),
};
// Update service metrics
if !services_map.contains_key(&service) {
services_map.insert(
service.clone(),
ServiceMetrics {
service: service.clone(),
team: team.clone(),
metrics: Vec::new(),
},
);
}
if let Some(service_metrics) = services_map.get_mut(&service) {
service_metrics.metrics.push(metrics.clone());
}
// Update team metrics
if !teams_map.contains_key(&team) {
teams_map.insert(team.clone(), Vec::new());
}
if let Some(team_metrics) = teams_map.get_mut(&team) {
team_metrics.push(metrics);
}
// Update org metrics
org_metrics.deployment_frequency += value;
}
// Process lead time
for (labels, value) in lead_time {
let service = labels.get("service").cloned().unwrap_or_default();
let team = labels.get("team").cloned().unwrap_or_default();
// Update service metrics
if let Some(service_metrics) = services_map.get_mut(&service) {
for metrics in &mut service_metrics.metrics {
if metrics.team == team {
metrics.lead_time_days = value;
}
}
}
// Update team metrics
if let Some(team_metrics) = teams_map.get_mut(&team) {
for metrics in team_metrics {
if metrics.service == service {
metrics.lead_time_days = value;
}
}
}
// Update org metrics
org_metrics.lead_time_days += value;
}
// Process MTTR
for (labels, value) in mttr {
let service = labels.get("service").cloned().unwrap_or_default();
let severity = labels.get("severity").cloned().unwrap_or_default();
// Only consider high severity incidents for MTTR
if severity != "high" {
continue;
}
// Find team for this service
let team = services_map
.get(&service)
.map(|s| s.team.clone())
.unwrap_or_default();
// Update service metrics
if let Some(service_metrics) = services_map.get_mut(&service) {
for metrics in &mut service_metrics.metrics {
metrics.mttr_hours = value;
}
}
// Update team metrics
if let Some(team_metrics) = teams_map.get_mut(&team) {
for metrics in team_metrics {
if metrics.service == service {
metrics.mttr_hours = value;
}
}
}
// Update org metrics
org_metrics.mttr_hours += value;
}
// Process change failure rate
for (labels, value) in change_failure_rate {
let service = labels.get("service").cloned().unwrap_or_default();
let team = labels.get("team").cloned().unwrap_or_default();
// Update service metrics
if let Some(service_metrics) = services_map.get_mut(&service) {
for metrics in &mut service_metrics.metrics {
if metrics.team == team {
metrics.change_failure_rate = value;
}
}
}
// Update team metrics
if let Some(team_metrics) = teams_map.get_mut(&team) {
for metrics in team_metrics {
if metrics.service == service {
metrics.change_failure_rate = value;
}
}
}
// Update org metrics
org_metrics.change_failure_rate += value;
}
// Calculate averages for org metrics
let service_count = services_map.len() as f64;
if service_count > 0.0 {
org_metrics.deployment_frequency /= service_count;
org_metrics.lead_time_days /= service_count;
org_metrics.mttr_hours /= service_count;
org_metrics.change_failure_rate /= service_count;
}
// Create team performance assessments
let mut teams = Vec::new();
for (team_name, team_metrics) in &teams_map {
// Calculate average metrics for team
let mut avg_metrics = DoraMetrics {
service: "team_average".to_string(),
team: team_name.clone(),
deployment_frequency: 0.0,
lead_time_days: 0.0,
mttr_hours: 0.0,
change_failure_rate: 0.0,
timestamp: Utc::now(),
};
let metric_count = team_metrics.len() as f64;
if metric_count > 0.0 {
for metrics in team_metrics {
avg_metrics.deployment_frequency += metrics.deployment_frequency;
avg_metrics.lead_time_days += metrics.lead_time_days;
avg_metrics.mttr_hours += metrics.mttr_hours;
avg_metrics.change_failure_rate += metrics.change_failure_rate;
}
avg_metrics.deployment_frequency /= metric_count;
avg_metrics.lead_time_days /= metric_count;
avg_metrics.mttr_hours /= metric_count;
avg_metrics.change_failure_rate /= metric_count;
}
// Determine performance level based on DORA metrics
let performance_level = determine_performance_level(&avg_metrics);
teams.push(TeamPerformance {
team: team_name.clone(),
performance_level,
metrics: avg_metrics,
});
}
// Create dashboard
let dashboard = Dashboard {
timestamp: Utc::now(),
services: services_map.into_values().collect(),
teams,
organization_metrics: org_metrics,
};
Ok(dashboard)
}
async fn query_prometheus(
client: &Client,
prometheus_url: &str,
query: &str,
) -> Result<HashMap<HashMap<String, String>, f64>, Box<dyn std::error::Error>> {
let url = format!("{}/api/v1/query", prometheus_url);
let response = client
.get(&url)
.query(&[("query", query)])
.send()
.await?
.json::<serde_json::Value>()
.await?;
let mut result = HashMap::new();
if let Some(data) = response.get("data") {
if let Some(result_type) = data.get("resultType") {
if result_type == "vector" {
if let Some(results) = data.get("result").and_then(|r| r.as_array()) {
for item in results {
if let (Some(metric), Some(value)) = (item.get("metric"), item.get("value")) {
if let Some(metric_obj) = metric.as_object() {
let mut labels = HashMap::new();
for (k, v) in metric_obj {
if let Some(v_str) = v.as_str() {
labels.insert(k.clone(), v_str.to_string());
}
}
if let Some(value_arr) = value.as_array() {
if value_arr.len() >= 2 {
if let Some(value_str) = value_arr[1].as_str() {
if let Ok(value_f64) = value_str.parse::<f64>() {
result.insert(labels, value_f64);
}
}
}
}
}
}
}
}
}
}
}
Ok(result)
}
fn determine_performance_level(metrics: &DoraMetrics) -> String {
// Based on DORA research
let mut score = 0;
// Deployment Frequency
if metrics.deployment_frequency >= 1.0 {
score += 3; // Multiple deploys per day: Elite
} else if metrics.deployment_frequency >= 0.14 {
score += 2; // Between once per day and once per week: High
} else if metrics.deployment_frequency >= 0.03 {
score += 1; // Between once per week and once per month: Medium
}
// Lead Time
if metrics.lead_time_days <= 1.0 {
score += 3; // Less than one day: Elite
} else if metrics.lead_time_days <= 7.0 {
score += 2; // Between one day and one week: High
} else if metrics.lead_time_days <= 30.0 {
score += 1; // Between one week and one month: Medium
}
// MTTR
if metrics.mttr_hours <= 1.0 {
score += 3; // Less than one hour: Elite
} else if metrics.mttr_hours <= 24.0 {
score += 2; // Less than one day: High
} else if metrics.mttr_hours <= 168.0 {
score += 1; // Less than one week: Medium
}
// Change Failure Rate
if metrics.change_failure_rate <= 15.0 {
score += 3; // 0-15%: Elite
} else if metrics.change_failure_rate <= 30.0 {
score += 2; // 16-30%: High
} else if metrics.change_failure_rate <= 45.0 {
score += 1; // 31-45%: Medium
}
// Determine level based on score
match score {
10..=12 => "Elite".to_string(),
7..=9 => "High".to_string(),
4..=6 => "Medium".to_string(),
_ => "Low".to_string(),
}
}
async fn get_dashboard(dashboard: web::Data<Arc<Mutex<Dashboard>>>) -> impl Responder {
let dashboard = dashboard.lock().await.clone();
HttpResponse::Ok().json(dashboard)
}
async fn get_services(dashboard: web::Data<Arc<Mutex<Dashboard>>>) -> impl Responder {
let dashboard = dashboard.lock().await.clone();
HttpResponse::Ok().json(dashboard.services)
}
async fn get_teams(dashboard: web::Data<Arc<Mutex<Dashboard>>>) -> impl Responder {
let dashboard = dashboard.lock().await.clone();
HttpResponse::Ok().json(dashboard.teams)
}
async fn get_service_metrics(
path: web::Path<String>,
dashboard: web::Data<Arc<Mutex<Dashboard>>>,
) -> impl Responder {
let service = path.into_inner();
let dashboard = dashboard.lock().await.clone();
for service_metrics in dashboard.services {
if service_metrics.service == service {
return HttpResponse::Ok().json(service_metrics);
}
}
HttpResponse::NotFound().body("Service not found")
}
async fn get_team_metrics(
path: web::Path<String>,
dashboard: web::Data<Arc<Mutex<Dashboard>>>,
) -> impl Responder {
let team = path.into_inner();
let dashboard = dashboard.lock().await.clone();
for team_performance in dashboard.teams {
if team_performance.team == team {
return HttpResponse::Ok().json(team_performance);
}
}
HttpResponse::NotFound().body("Team not found")
}
Lessons Learned:
DORA metrics must be carefully implemented to accurately reflect delivery performance.
How to Avoid:
Define clear and consistent metrics definitions aligned with industry standards.
Ensure metrics capture the entire software delivery lifecycle.
Validate metrics implementation against actual team experiences.
Include context and severity in metrics to avoid misleading conclusions.
Regularly review and refine metrics implementation as processes evolve.
No summary provided
What Happened:
A large enterprise implemented DORA metrics to measure DevOps performance. After six months, metrics showed excellent performance suggesting "Elite" level according to DORA research. However, teams still experienced significant delivery challenges and customer complaints. Leadership was confused by the disconnect between positive metrics and negative reality.
Diagnosis Steps:
Analyzed how each DORA metric was being calculated and collected.
Reviewed the data sources and collection methods for each metric.
Compared metric definitions with industry standards.
Interviewed teams about their actual deployment processes.
Conducted a manual audit of recent incidents and deployments.
Root Cause:
The investigation revealed multiple issues with the metrics implementation: 1. Deployment Frequency was counting all pipeline runs, not just successful production deployments 2. Lead Time for Changes was only measuring the time from commit to build, not to production deployment 3. Time to Restore Service was only counting officially declared incidents, missing many smaller outages 4. Change Failure Rate was only counting rollbacks, not all types of deployment-related failures 5. The metrics dashboard lacked context and proper statistical analysis
Fix/Workaround:
• Implemented correct metrics calculations in Python and Go
• Created a comprehensive DORA metrics dashboard in Grafana
• Added percentile-based reporting instead of just averages
• Implemented proper data collection from all relevant sources
• Established clear definitions and documentation for each metric
Lessons Learned:
Metrics implementation requires careful attention to definitions and data sources to provide accurate insights.
How to Avoid:
Follow industry standard definitions for DevOps metrics.
Validate metrics implementation with real-world observations.
Include statistical context like percentiles, not just averages.
Ensure comprehensive data collection from all relevant sources.
Regularly audit and validate metrics against actual performance.
No summary provided
What Happened:
A product team implemented SLOs for their microservices using Prometheus and Grafana. They set targets for response time, error rate, and availability. Despite the monitoring showing all services meeting their SLOs, users reported frequent slowness and timeouts. The disconnect between monitoring and user experience created confusion and tension between the development and operations teams.
Diagnosis Steps:
Analyzed the implementation of SLO metrics in Prometheus.
Reviewed the query expressions used in dashboards and alerts.
Compared monitoring data with actual user experience reports.
Conducted load testing to reproduce the reported issues.
Analyzed raw metrics data to identify patterns.
Root Cause:
The investigation revealed that the team was using mean (average) values for response time metrics instead of percentiles. This approach masked the "long tail" of slow responses that significantly impacted user experience. While the average response time remained within acceptable limits, a substantial percentage of requests were experiencing much longer response times.
Fix/Workaround:
• Implemented proper percentile-based SLOs using Prometheus histograms:
# prometheus.yml - Updated scrape config with histogram buckets
scrape_configs:
- job_name: 'api-service'
metrics_path: '/metrics'
scrape_interval: 15s
static_configs:
- targets: ['api-service:8080']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'http_request_duration_seconds_bucket'
action: keep
• Created PromQL queries for percentile-based SLOs:
# 95th percentile response time for the last 5 minutes
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api-service"}[5m])) by (le))
# Success rate (inverted error rate) as a percentage
sum(rate(http_requests_total{job="api-service",status_code=~"2.."}[5m])) / sum(rate(http_requests_total{job="api-service"}[5m])) * 100
# Availability as percentage of successful probes
sum(probe_success{job="blackbox",target=~"https://api.example.com/.*"}) / count(probe_success{job="blackbox",target=~"https://api.example.com/.*"}) * 100
• Implemented multi-window, multi-burn-rate alerts in Prometheus:
# prometheus-rules.yml - SLO alert rules
groups:
- name: slo-alerts
rules:
- record: job:http_request_duration_seconds:99percentile
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api-service"}[5m])) by (le))
- record: job:http_request_success_ratio
expr: sum(rate(http_requests_total{job="api-service",status_code=~"2.."}[5m])) / sum(rate(http_requests_total{job="api-service"}[5m]))
- alert: HighLatency
expr: job:http_request_duration_seconds:99percentile > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "99th percentile latency is above 500ms for 5 minutes"
- alert: HighLatencySevere
expr: job:http_request_duration_seconds:99percentile > 1
for: 1m
labels:
severity: critical
annotations:
summary: "Severe latency detected"
description: "99th percentile latency is above 1s for 1 minute"
- alert: ErrorBudgetBurn
expr: |
(
job:http_request_success_ratio < 0.99 and
job:http_request_success_ratio offset 1h >= 0.99
)
for: 5m
labels:
severity: warning
annotations:
summary: "Error budget burning too fast"
description: "Success ratio dropped below 99% in the last hour"
• Created a Go-based SLO monitoring library for consistent implementation:
// slo/metrics.go
package slo
import (
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
// SLOMetrics holds the Prometheus metrics for SLO monitoring
type SLOMetrics struct {
RequestDuration *prometheus.HistogramVec
RequestTotal *prometheus.CounterVec
ErrorTotal *prometheus.CounterVec
}
// NewSLOMetrics creates a new SLOMetrics instance with proper histogram buckets
func NewSLOMetrics(namespace, subsystem string) *SLOMetrics {
// Define buckets appropriate for SLO monitoring
// These buckets cover from 5ms to 10s with concentration around SLO targets
buckets := []float64{0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10}
return &SLOMetrics{
RequestDuration: promauto.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "request_duration_seconds",
Help: "Request duration in seconds",
Buckets: buckets,
},
[]string{"handler", "method", "status"},
),
RequestTotal: promauto.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "requests_total",
Help: "Total number of requests",
},
[]string{"handler", "method", "status"},
),
ErrorTotal: promauto.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "errors_total",
Help: "Total number of errors",
},
[]string{"handler", "method", "error_type"},
),
}
}
// ObserveRequest records metrics for a single request
func (m *SLOMetrics) ObserveRequest(handler, method, status string, duration time.Duration, err error) {
// Record request duration
m.RequestDuration.WithLabelValues(handler, method, status).Observe(duration.Seconds())
// Increment request counter
m.RequestTotal.WithLabelValues(handler, method, status).Inc()
// If there was an error, record it
if err != nil {
errorType := "unknown"
switch err.(type) {
case *TimeoutError:
errorType = "timeout"
case *ValidationError:
errorType = "validation"
case *AuthorizationError:
errorType = "authorization"
case *DatabaseError:
errorType = "database"
}
m.ErrorTotal.WithLabelValues(handler, method, errorType).Inc()
}
}
• Developed a Grafana dashboard with percentile-based SLOs:
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 1,
"links": [],
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"hiddenSeries": false,
"id": 2,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])) by (le))",
"interval": "",
"legendFormat": "p50",
"refId": "A"
},
{
"expr": "histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])) by (le))",
"interval": "",
"legendFormat": "p90",
"refId": "B"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])) by (le))",
"interval": "",
"legendFormat": "p95",
"refId": "C"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])) by (le))",
"interval": "",
"legendFormat": "p99",
"refId": "D"
}
],
"thresholds": [
{
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt",
"value": 0.5,
"yaxis": "left"
}
],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Response Time Percentiles",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"hiddenSeries": false,
"id": 4,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\",status_code=~\"2..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100",
"interval": "",
"legendFormat": "Success Rate",
"refId": "A"
}
],
"thresholds": [
{
"colorMode": "critical",
"fill": true,
"line": true,
"op": "lt",
"value": 99,
"yaxis": "left"
}
],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Success Rate",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "percent",
"label": null,
"logBase": 1,
"max": "100",
"min": "95",
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": "5s",
"schemaVersion": 26,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "Service SLOs",
"uid": "slo-dashboard",
"version": 1
}
Lessons Learned:
Mean values can hide significant performance issues that impact user experience.
How to Avoid:
Always use percentiles (p95, p99) rather than averages for latency metrics.
Implement proper histogram buckets in Prometheus for accurate percentile calculation.
Consider multi-window, multi-burn-rate alerting for SLOs.
Validate metrics against actual user experience.
Include both technical and user-centric metrics in SLOs.
No summary provided
What Happened:
A product team set up service level objectives (SLOs) based on mean response time metrics. While the mean response time consistently met the target of 200ms, customer complaints about slow performance continued to increase. Investigation revealed that while the mean response time was within acceptable limits, the 95th and 99th percentile response times were significantly higher, indicating that a substantial portion of users were experiencing poor performance.
Diagnosis Steps:
Analyzed detailed response time distributions beyond mean values.
Compared mean, median, 90th, 95th, and 99th percentile metrics.
Segmented performance data by user type, region, and request type.
Reviewed correlation between customer complaints and traffic patterns.
Examined outlier response times and their impact on the mean.
Root Cause:
The mean response time was being skewed by a large number of fast, cached responses, hiding the impact of slow outliers that were affecting real user experience.
Fix/Workaround:
• Implemented percentile-based SLOs instead of mean-based metrics
• Added multi-dimensional monitoring for different request types
• Created separate dashboards for different user segments
• Implemented automated alerting on percentile thresholds
• Developed a user experience score combining multiple metrics
Lessons Learned:
Mean values can hide significant performance issues; percentiles provide better visibility into actual user experience.
How to Avoid:
Use percentile-based metrics (p50, p90, p95, p99) for performance SLOs.
Implement multi-dimensional monitoring for different user segments.
Correlate metrics with actual user experience and feedback.
Consider the distribution of values, not just aggregate statistics.
Regularly review and update metrics to ensure they reflect user experience.
No summary provided
What Happened:
A technology company wanted to improve its deployment frequency as a key DevOps metric. They implemented a metrics dashboard to track deployments across teams, but the reported numbers were inconsistent and unreliable. Some teams showed impossibly high deployment counts, while others showed none despite known deployments. Leadership couldn't use the data for decision-making, and improvement initiatives were stalled due to the lack of reliable baseline metrics.
Diagnosis Steps:
Analyzed how deployments were defined and tracked across teams.
Reviewed CI/CD pipeline configurations and deployment processes.
Examined the metrics collection and reporting mechanisms.
Interviewed teams about their deployment practices.
Compared manual deployment records with automated tracking.
Root Cause:
The investigation revealed multiple issues with deployment tracking: 1. Inconsistent definition of "deployment" across teams (some counted feature flags, others only production releases) 2. Multiple deployment pipelines with different tracking mechanisms 3. Some teams using manual deployments not captured by automated tracking 4. Metrics collection relying on inconsistent event triggers 5. No standardized deployment tagging or labeling system
Fix/Workaround:
• Implemented a standardized definition of deployment across teams
• Created consistent deployment event tracking in all pipelines
• Developed a unified metrics collection framework
• Implemented deployment tagging and correlation
• Established data validation processes for metrics
Lessons Learned:
Effective metrics require standardized definitions and consistent collection mechanisms.
How to Avoid:
Establish clear, organization-wide definitions for key metrics.
Implement consistent tracking mechanisms across all deployment pipelines.
Create a unified metrics collection framework with data validation.
Regularly audit and validate metrics data against known activities.
Involve teams in metrics definition to ensure buy-in and accuracy.
No summary provided
What Happened:
A large enterprise implemented DORA metrics to measure their DevOps performance and set targets for improvement. After six months, while the metrics showed significant improvement, actual delivery speed and quality had not improved and in some cases had deteriorated. Teams were gaming the metrics by breaking changes into tiny deployments, marking incidents as "resolved" prematurely, and avoiding risky but necessary changes. Leadership was confused by the disconnect between the positive metrics and the negative feedback from customers and engineers.
Diagnosis Steps:
Analyzed how metrics were being collected and calculated.
Interviewed teams about how they were responding to metric targets.
Compared metric definitions with industry standards.
Reviewed actual incidents and deployment data.
Examined the correlation between metrics and business outcomes.
Root Cause:
The investigation revealed multiple issues with the metrics implementation: 1. Deployment Frequency counted any deployment regardless of size, encouraging teams to deploy trivial changes 2. Lead Time measurement started at code commit rather than when work was initiated 3. MTTR was calculated based on when incidents were marked as "resolved" not when they were actually fixed 4. Change Failure Rate only counted failures that triggered formal incidents, missing many quality issues 5. Teams were being evaluated primarily on these metrics without context
Fix/Workaround:
• Implemented a revised metrics framework with the following improvements:
• Redefined metrics to align with their intended purpose
• Added context and supplementary metrics to provide a more complete picture
• Implemented automated collection to reduce manual manipulation
• Created a balanced scorecard approach rather than focusing on individual metrics
• Educated leadership on proper interpretation of the metrics
// TypeScript implementation of improved DORA metrics collection
// File: improved-dora-metrics.ts
import { PullRequest, Deployment, Incident, WorkItem } from './types';
export class DORAMetricsCalculator {
// Deployment Frequency - now weighted by deployment size and impact
calculateDeploymentFrequency(
deployments: Deployment[],
startDate: Date,
endDate: Date
): { frequency: number; weightedFrequency: number } {
const daysDiff = this.daysBetween(startDate, endDate);
const deploymentsInPeriod = deployments.filter(
d => d.timestamp >= startDate && d.timestamp <= endDate
);
// Basic frequency
const frequency = deploymentsInPeriod.length / daysDiff;
// Weighted frequency based on deployment size and impact
const totalWeight = deploymentsInPeriod.reduce(
(sum, d) => sum + this.calculateDeploymentWeight(d), 0
);
const weightedFrequency = totalWeight / daysDiff;
return { frequency, weightedFrequency };
}
// Helper to calculate deployment weight based on size and impact
private calculateDeploymentWeight(deployment: Deployment): number {
// Base weight
let weight = 1;
// Adjust based on lines of code changed
if (deployment.linesChanged > 1000) weight *= 2;
if (deployment.linesChanged > 5000) weight *= 1.5;
// Adjust based on number of services affected
weight *= (1 + (deployment.servicesAffected.length * 0.2));
// Adjust based on risk level
switch (deployment.riskLevel) {
case 'high': weight *= 3; break;
case 'medium': weight *= 2; break;
case 'low': weight *= 1; break;
}
return weight;
}
// Lead Time - now measured from work item creation, not just code commit
calculateLeadTime(
workItems: WorkItem[],
startDate: Date,
endDate: Date
): { meanLeadTime: number; medianLeadTime: number; p90LeadTime: number } {
const completedItems = workItems.filter(
wi => wi.completionDate >= startDate &&
wi.completionDate <= endDate &&
wi.status === 'completed'
);
if (completedItems.length === 0) {
return { meanLeadTime: 0, medianLeadTime: 0, p90LeadTime: 0 };
}
// Calculate lead times in hours
const leadTimes = completedItems.map(wi => {
// From work item creation to deployment
const creationToDeployment = this.hoursBetween(
wi.creationDate,
wi.deploymentDate || wi.completionDate
);
return creationToDeployment;
});
// Sort lead times for percentile calculations
leadTimes.sort((a, b) => a - b);
return {
meanLeadTime: this.calculateMean(leadTimes),
medianLeadTime: this.calculateMedian(leadTimes),
p90LeadTime: this.calculatePercentile(leadTimes, 90)
};
}
// MTTR - now verified by automated tests, not just status changes
calculateMTTR(
incidents: Incident[],
startDate: Date,
endDate: Date
): { mttr: number; verifiedMttr: number } {
const resolvedIncidents = incidents.filter(
i => i.resolvedDate >= startDate &&
i.resolvedDate <= endDate &&
i.status === 'resolved'
);
if (resolvedIncidents.length === 0) {
return { mttr: 0, verifiedMttr: 0 };
}
// Calculate traditional MTTR (hours)
const repairTimes = resolvedIncidents.map(i =>
this.hoursBetween(i.detectedDate, i.resolvedDate)
);
// Calculate verified MTTR (only count incidents with verification)
const verifiedIncidents = resolvedIncidents.filter(i => i.verificationPassed);
const verifiedRepairTimes = verifiedIncidents.map(i =>
this.hoursBetween(i.detectedDate, i.verifiedDate || i.resolvedDate)
);
return {
mttr: this.calculateMean(repairTimes),
verifiedMttr: verifiedIncidents.length > 0
? this.calculateMean(verifiedRepairTimes)
: 0
};
}
// Change Failure Rate - now includes quality issues, not just incidents
calculateChangeFailureRate(
deployments: Deployment[],
startDate: Date,
endDate: Date
): { traditionalCFR: number; enhancedCFR: number } {
const deploymentsInPeriod = deployments.filter(
d => d.timestamp >= startDate && d.timestamp <= endDate
);
if (deploymentsInPeriod.length === 0) {
return { traditionalCFR: 0, enhancedCFR: 0 };
}
// Traditional CFR - only counts deployments that caused incidents
const failedDeployments = deploymentsInPeriod.filter(d => d.causedIncident);
const traditionalCFR = failedDeployments.length / deploymentsInPeriod.length;
// Enhanced CFR - includes quality issues and customer-reported problems
const problematicDeployments = deploymentsInPeriod.filter(d =>
d.causedIncident ||
d.qualityIssuesCount > 0 ||
d.customerReportedProblems > 0
);
const enhancedCFR = problematicDeployments.length / deploymentsInPeriod.length;
return { traditionalCFR, enhancedCFR };
}
// Helper methods
private daysBetween(start: Date, end: Date): number {
return (end.getTime() - start.getTime()) / (1000 * 60 * 60 * 24);
}
private hoursBetween(start: Date, end: Date): number {
return (end.getTime() - start.getTime()) / (1000 * 60 * 60);
}
private calculateMean(values: number[]): number {
return values.reduce((sum, val) => sum + val, 0) / values.length;
}
private calculateMedian(sortedValues: number[]): number {
const mid = Math.floor(sortedValues.length / 2);
return sortedValues.length % 2 === 0
? (sortedValues[mid - 1] + sortedValues[mid]) / 2
: sortedValues[mid];
}
private calculatePercentile(sortedValues: number[], percentile: number): number {
const index = Math.ceil((percentile / 100) * sortedValues.length) - 1;
return sortedValues[Math.max(0, Math.min(index, sortedValues.length - 1))];
}
}
# Balanced Scorecard Configuration
# File: metrics-scorecard.yaml
scorecard:
name: "DevOps Performance Scorecard"
description: "A balanced view of delivery performance across multiple dimensions"
categories:
- name: "Delivery Speed"
weight: 0.25
metrics:
- id: "deployment_frequency"
name: "Deployment Frequency"
weight: 0.3
target: "Daily"
warning_threshold: "Weekly"
danger_threshold: "Monthly"
- id: "lead_time"
name: "Lead Time for Changes"
weight: 0.4
target: "< 1 day"
warning_threshold: "< 1 week"
danger_threshold: "> 1 month"
- id: "cycle_time"
name: "Cycle Time"
weight: 0.3
target: "< 3 days"
warning_threshold: "< 2 weeks"
danger_threshold: "> 1 month"
- name: "Reliability"
weight: 0.25
metrics:
- id: "mttr"
name: "Mean Time to Recovery"
weight: 0.4
target: "< 1 hour"
warning_threshold: "< 1 day"
danger_threshold: "> 1 week"
- id: "change_failure_rate"
name: "Change Failure Rate"
weight: 0.3
target: "< 5%"
warning_threshold: "< 15%"
danger_threshold: "> 30%"
- id: "availability"
name: "Service Availability"
weight: 0.3
target: "> 99.9%"
warning_threshold: "> 99.5%"
danger_threshold: "< 99%"
- name: "Quality"
weight: 0.25
metrics:
- id: "defect_density"
name: "Defect Density"
weight: 0.3
target: "< 0.1 per 100 LOC"
warning_threshold: "< 0.5 per 100 LOC"
danger_threshold: "> 1 per 100 LOC"
- id: "test_coverage"
name: "Test Coverage"
weight: 0.3
target: "> 80%"
warning_threshold: "> 60%"
danger_threshold: "< 40%"
- id: "technical_debt"
name: "Technical Debt Ratio"
weight: 0.4
target: "< 5%"
warning_threshold: "< 15%"
danger_threshold: "> 25%"
- name: "Culture & Learning"
weight: 0.25
metrics:
- id: "blameless_postmortems"
name: "Blameless Postmortems Completed"
weight: 0.3
target: "100%"
warning_threshold: "> 80%"
danger_threshold: "< 60%"
- id: "learning_from_failures"
name: "Improvements Implemented from Incidents"
weight: 0.4
target: "> 3 per incident"
warning_threshold: "> 1 per incident"
danger_threshold: "< 1 per incident"
- id: "team_satisfaction"
name: "Team Satisfaction Score"
weight: 0.3
target: "> 8/10"
warning_threshold: "> 6/10"
danger_threshold: "< 5/10"
visualization:
dashboard_refresh_rate: "daily"
trend_period: "13 weeks"
show_targets: true
show_thresholds: true
enable_drill_down: true
data_collection:
automated_sources:
- source: "github"
metrics: ["deployment_frequency", "lead_time", "cycle_time"]
- source: "jenkins"
metrics: ["deployment_frequency", "change_failure_rate"]
- source: "jira"
metrics: ["lead_time", "cycle_time", "defect_density"]
- source: "sonarqube"
metrics: ["test_coverage", "technical_debt"]
- source: "pagerduty"
metrics: ["mttr"]
- source: "prometheus"
metrics: ["availability"]
manual_sources:
- source: "team_surveys"
metrics: ["team_satisfaction", "learning_from_failures"]
- source: "incident_reviews"
metrics: ["blameless_postmortems", "learning_from_failures"]
Lessons Learned:
DevOps metrics must be carefully designed to drive the right behaviors and provide an accurate picture of performance.
How to Avoid:
Implement metrics that measure outcomes, not just activities.
Use multiple metrics to provide a balanced view of performance.
Automate metrics collection to reduce manipulation.
Educate teams on the purpose and proper use of metrics.
Regularly review and refine metrics to ensure they drive the right behaviors.
No summary provided
What Happened:
A large financial services company implemented DORA metrics to measure their DevOps performance and drive improvements. After six months, they were puzzled by contradictory results - their metrics showed excellent performance, but teams were still experiencing frequent production issues and customer complaints. An audit revealed that the metrics implementation had fundamental flaws that made the data misleading. Deployment frequency counted test deployments, lead time only measured code review to deployment (not idea to production), MTTR calculations excluded certain types of incidents, and change failure rate didn't account for all types of failures.
Diagnosis Steps:
Analyzed how each DORA metric was being calculated.
Compared metric definitions with industry standards.
Reviewed data collection methods and sources.
Interviewed teams about their actual experiences.
Conducted a gap analysis between reported metrics and observed outcomes.
Root Cause:
The investigation revealed multiple issues with the metrics implementation: 1. Metrics were defined without clear understanding of their purpose 2. Data collection was incomplete and inconsistent across teams 3. Some metrics were implemented to "game the system" rather than drive improvement 4. There was no validation process to ensure metrics reflected reality 5. Teams were incentivized to improve metrics rather than actual performance
Fix/Workaround:
• Implemented a revised metrics framework with clear definitions
• Established consistent data collection methods across teams
• Created validation processes to ensure metrics reflected reality
• Aligned incentives with actual performance outcomes
• Implemented regular reviews of metrics effectiveness
Lessons Learned:
DevOps metrics are only valuable if they accurately reflect reality and drive the right behaviors.
How to Avoid:
Define metrics with clear purpose and alignment to business outcomes.
Implement consistent data collection methods across teams.
Validate metrics against observed reality.
Regularly review and refine metrics definitions.
Avoid incentivizing metrics improvement without corresponding performance improvement.