The monitoring system suddenly stopped collecting metrics, and the Prometheus UI became unresponsive. Alerts were not firing despite known issues in the environment, creating a blind spot for operations.
# Infrastructure Monitoring Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Checked Prometheus pod status and logs.
Examined storage volume usage and permissions.
Analyzed recent configuration changes.
Reviewed Kubernetes events related to the monitoring namespace.
Inspected Prometheus data directory for corruption.
Root Cause:
The Prometheus server experienced storage corruption due to an unclean shutdown during a node failure. The WAL (Write-Ahead Log) was corrupted, preventing Prometheus from starting properly. Additionally, the storage volume was undersized, leading to frequent disk pressure that exacerbated the issue.
Fix/Workaround:
• Short-term: Restored Prometheus by clearing the corrupted data:
# Access the Prometheus pod
kubectl exec -it prometheus-server-0 -n monitoring -- /bin/sh
# Backup corrupted data (optional)
tar -czf /tmp/prometheus-data-backup.tar.gz /prometheus
# Clear corrupted WAL files
rm -rf /prometheus/wal/*
# Restart Prometheus
exit
kubectl rollout restart statefulset/prometheus-server -n monitoring
• Long-term: Implemented proper storage management and resilience:
# Prometheus StatefulSet with proper storage configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-server
namespace: monitoring
labels:
app: prometheus
component: server
spec:
serviceName: prometheus-server
replicas: 2
selector:
matchLabels:
app: prometheus
component: server
template:
metadata:
labels:
app: prometheus
component: server
spec:
serviceAccountName: prometheus
securityContext:
fsGroup: 65534
runAsUser: 65534
runAsNonRoot: true
containers:
- name: prometheus
image: prom/prometheus:v2.35.0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d"
- "--storage.tsdb.wal-compression"
- "--storage.tsdb.allow-overlapping-blocks"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
- "--web.enable-lifecycle"
ports:
- containerPort: 9090
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 1
memory: 4Gi
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
volumeClaimTemplates:
- metadata:
name: prometheus-storage
spec:
storageClassName: ssd
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
• Added automated backup and recovery:
#!/bin/bash
# prometheus_backup.sh
set -euo pipefail
BACKUP_DIR="/backups/prometheus"
PROMETHEUS_DATA="/prometheus"
RETENTION_DAYS=7
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/prometheus-backup-${TIMESTAMP}.tar.gz"
# Ensure backup directory exists
mkdir -p ${BACKUP_DIR}
# Create snapshot
curl -XPOST http://localhost:9090/-/snapshot
SNAPSHOT_DIR=$(ls -td ${PROMETHEUS_DATA}/snapshots/* | head -n 1)
# Backup the snapshot
tar -czf ${BACKUP_FILE} -C ${SNAPSHOT_DIR} .
# Clean up old snapshots
find ${PROMETHEUS_DATA}/snapshots -type d -mtime +1 -exec rm -rf {} \; 2>/dev/null || true
# Clean up old backups
find ${BACKUP_DIR} -name "prometheus-backup-*.tar.gz" -type f -mtime +${RETENTION_DAYS} -delete
echo "Backup completed: ${BACKUP_FILE}"
• Implemented monitoring for the monitoring system:
# Prometheus alert rules for monitoring Prometheus itself
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: prometheus-self-monitoring
namespace: monitoring
spec:
groups:
- name: prometheus.rules
rules:
- alert: PrometheusStorageAlmostFull
expr: (prometheus_tsdb_storage_blocks_bytes / prometheus_tsdb_storage_blocks_bytes_total) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Prometheus storage is almost full"
description: "Prometheus storage is {{ $value }}% full. Consider increasing storage or reducing retention period."
- alert: PrometheusWALCorruption
expr: rate(prometheus_tsdb_wal_corruptions_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus WAL corruption detected"
description: "Prometheus {{ $labels.instance }} has detected WAL corruption."
- alert: PrometheusTooManyRestarts
expr: changes(process_start_time_seconds{job="prometheus"}[1h]) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus restarting too frequently"
description: "Prometheus {{ $labels.instance }} has restarted {{ $value }} times in the last hour."
Lessons Learned:
Monitoring systems require their own monitoring and proper storage management.
How to Avoid:
Implement proper storage sizing and monitoring.
Configure WAL compression to reduce storage pressure.
Set up regular backups of Prometheus data.
Use Thanos or Cortex for long-term storage and high availability.
Monitor the monitoring system itself.
No summary provided
What Happened:
Operations teams reported that the monitoring dashboards had become increasingly difficult to use during incident response. Dashboards were loading slowly, contained too much information, and critical metrics were buried among less important ones.
Diagnosis Steps:
Analyzed dashboard loading times and performance.
Reviewed the number of panels and queries per dashboard.
Examined query complexity and cardinality.
Interviewed different teams about their dashboard usage patterns.
Audited dashboard permissions and ownership.
Root Cause:
Over time, the monitoring dashboards had grown organically without proper governance. Teams had continuously added new metrics without removing old ones, resulting in dashboards with hundreds of panels. Many queries were inefficient, and there was significant duplication across dashboards. Additionally, there was no clear ownership or organization structure.
Fix/Workaround:
• Short-term: Optimized the most critical dashboards:
-- Optimized Prometheus queries
-- Before: Inefficient query with high cardinality
sum by(instance) (rate(http_requests_total{job="api-server"}[5m]))
-- After: More efficient query with reduced cardinality
sum by(instance) (rate(http_requests_total{job="api-server", handler=~"/api/v1/.*"}[5m]))
• Long-term: Implemented a dashboard-as-code approach with Grafonnet:
// dashboard.jsonnet
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.row;
local prometheus = grafana.prometheus;
local template = grafana.template;
local graphPanel = grafana.graphPanel;
dashboard.new(
'API Service Dashboard',
tags=['api', 'service'],
editable=true,
time_from='now-6h',
refresh='1m',
uid='api-service-dashboard',
)
.addTemplate(
template.datasource(
'PROMETHEUS_DS',
'prometheus',
'Prometheus',
hide='label',
)
)
.addTemplate(
template.new(
'instance',
'$PROMETHEUS_DS',
'label_values(up{job="api-server"}, instance)',
label='Instance',
refresh='time',
includeAll=true,
)
)
.addRow(
row.new(
title='Request Overview',
height='250px',
)
.addPanel(
graphPanel.new(
'Request Rate',
description='HTTP requests per second',
datasource='$PROMETHEUS_DS',
format='ops',
min=0,
)
.addTarget(
prometheus.target(
'sum by(instance) (rate(http_requests_total{job="api-server", instance=~"$instance"}[5m]))',
legendFormat='{{instance}}',
)
)
)
.addPanel(
graphPanel.new(
'Error Rate',
description='HTTP error rate',
datasource='$PROMETHEUS_DS',
format='percentunit',
min=0,
max=1,
)
.addTarget(
prometheus.target(
'sum by(instance) (rate(http_requests_total{job="api-server", instance=~"$instance", status_code=~"5.."}[5m])) / sum by(instance) (rate(http_requests_total{job="api-server", instance=~"$instance"}[5m]))',
legendFormat='{{instance}}',
)
)
)
)
• Created a dashboard governance model:
# dashboard_governance.yaml
dashboard_categories:
- name: Service Health
description: High-level service health dashboards for each major service
audience: All teams
refresh_rate: 1m
max_time_range: 7d
ownership: Platform team
- name: Business Metrics
description: Business-focused metrics and KPIs
audience: Product and business teams
refresh_rate: 5m
max_time_range: 30d
ownership: Data team
- name: Operational Metrics
description: Detailed operational metrics for troubleshooting
audience: SRE and service teams
refresh_rate: 30s
max_time_range: 24h
ownership: Service teams
dashboard_standards:
- Maximum of 20 panels per dashboard
- Each panel must have a clear title and description
- Consistent color scheme across related dashboards
- All dashboards must include service owner contact information
- Queries should be optimized for performance
- Critical thresholds should be indicated on graphs
- All custom dashboards must be created through the CI/CD pipeline
review_process:
frequency: Quarterly
participants:
- SRE representative
- Service owner
- Data team representative
activities:
- Review dashboard usage metrics
- Remove or archive unused dashboards
- Optimize inefficient queries
- Ensure compliance with standards
- Update documentation
• Implemented automated dashboard testing:
#!/usr/bin/env python3
# dashboard_validator.py
import json
import sys
import requests
import time
def validate_dashboard(dashboard_json):
"""Validate a Grafana dashboard JSON for compliance with standards."""
dashboard = json.loads(dashboard_json)
issues = []
# Check panel count
panel_count = count_panels(dashboard)
if panel_count > 20:
issues.append(f"Dashboard has {panel_count} panels, exceeding the maximum of 20")
# Check for panel titles and descriptions
for panel in get_all_panels(dashboard):
if not panel.get('title'):
issues.append(f"Panel ID {panel.get('id')} is missing a title")
if not panel.get('description'):
issues.append(f"Panel '{panel.get('title')}' is missing a description")
# Check for contact information
if 'tags' not in dashboard or 'owner' not in dashboard.get('tags', []):
issues.append("Dashboard is missing owner tag")
# Check for performance issues in queries
for panel in get_all_panels(dashboard):
for target in panel.get('targets', []):
query = target.get('expr', '')
if query:
performance_issues = check_query_performance(query)
if performance_issues:
issues.append(f"Performance issues in query for panel '{panel.get('title')}': {performance_issues}")
return issues
def count_panels(dashboard):
"""Count the total number of panels in a dashboard."""
return len(get_all_panels(dashboard))
def get_all_panels(dashboard):
"""Extract all panels from a dashboard, including those in rows."""
panels = []
for panel in dashboard.get('panels', []):
if panel.get('type') == 'row':
panels.extend(panel.get('panels', []))
else:
panels.append(panel)
return panels
def check_query_performance(query):
"""Check a Prometheus query for potential performance issues."""
issues = []
# Check for missing job labels
if '{' in query and 'job=' not in query:
issues.append("Query is missing job label filter")
# Check for high cardinality operations
high_cardinality_ops = ['group_left', 'group_right']
for op in high_cardinality_ops:
if op in query:
issues.append(f"Query uses high cardinality operation '{op}'")
# Check for inefficient rate() usage
if 'rate(' in query and '[5m]' not in query and '[1m]' not in query:
issues.append("Query uses rate() without appropriate time window")
return issues
def test_dashboard_loading(dashboard_id, grafana_url, api_key):
"""Test the loading time of a dashboard."""
headers = {
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
start_time = time.time()
response = requests.get(f"{grafana_url}/api/dashboards/uid/{dashboard_id}", headers=headers)
end_time = time.time()
if response.status_code != 200:
return f"Failed to load dashboard: {response.status_code} {response.text}"
load_time = end_time - start_time
if load_time > 1.0:
return f"Dashboard load time is slow: {load_time:.2f} seconds"
return None
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: dashboard_validator.py <dashboard.json>")
sys.exit(1)
with open(sys.argv[1], 'r') as f:
dashboard_json = f.read()
issues = validate_dashboard(dashboard_json)
if issues:
print("Dashboard validation failed with the following issues:")
for issue in issues:
print(f"- {issue}")
sys.exit(1)
else:
print("Dashboard validation passed!")
sys.exit(0)
Lessons Learned:
Monitoring dashboards require governance and optimization to remain useful.
How to Avoid:
Implement dashboard-as-code for version control and consistency.
Establish clear ownership and review processes for dashboards.
Set standards for dashboard design and query efficiency.
Regularly audit and clean up unused or inefficient dashboards.
Create purpose-specific dashboards rather than all-in-one dashboards.
No summary provided
What Happened:
During a routine check, the operations team noticed that several critical dashboards in Grafana were showing gaps in data. Upon investigation, they found that the Prometheus server was experiencing storage corruption issues, with some metrics completely missing and others showing inconsistent values.
Diagnosis Steps:
Examined Prometheus logs for error messages.
Checked disk usage and I/O performance on the Prometheus server.
Verified Prometheus configuration and retention settings.
Analyzed recent changes to the monitoring infrastructure.
Tested querying specific metrics directly from Prometheus API.
Root Cause:
Multiple factors contributed to the storage corruption: 1. The Prometheus instance was running on a node with unstable storage performance. 2. The TSDB (Time Series Database) compaction process was interrupted multiple times due to pod evictions. 3. The retention period was set too high for the allocated storage. 4. No regular backups of Prometheus data were configured.
Fix/Workaround:
• Short-term: Restored from the most recent snapshot and implemented storage improvements:
# Prometheus StatefulSet with improved storage configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
namespace: monitoring
spec:
serviceName: "prometheus"
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
containers:
- name: prometheus
image: prom/prometheus:v2.35.0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d"
- "--storage.tsdb.retention.size=50GB"
- "--storage.tsdb.wal-compression=true"
- "--storage.tsdb.allow-overlapping-blocks=false"
- "--storage.tsdb.max-block-duration=2h"
- "--storage.tsdb.min-block-duration=2h"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
- "--web.enable-lifecycle"
ports:
- containerPort: 9090
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
resources:
requests:
cpu: 1
memory: 4Gi
limits:
cpu: 2
memory: 8Gi
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
volumeClaimTemplates:
- metadata:
name: prometheus-storage
spec:
storageClassName: premium-ssd
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
• Long-term: Implemented a comprehensive monitoring resilience strategy:
// prometheus_backup.go
package main
import (
"context"
"fmt"
"log"
"os"
"os/exec"
"path/filepath"
"time"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/s3"
"github.com/aws/aws-sdk-go-v2/service/s3/types"
"github.com/robfig/cron/v3"
"gopkg.in/yaml.v3"
)
type BackupConfig struct {
Prometheus struct {
URL string `yaml:"url"`
DataDir string `yaml:"dataDir"`
Username string `yaml:"username"`
Password string `yaml:"password"`
} `yaml:"prometheus"`
S3 struct {
Bucket string `yaml:"bucket"`
Region string `yaml:"region"`
KeyPrefix string `yaml:"keyPrefix"`
} `yaml:"s3"`
Retention struct {
Days int `yaml:"days"`
} `yaml:"retention"`
Schedule string `yaml:"schedule"`
}
func main() {
// Load configuration
configFile, err := os.ReadFile("backup_config.yaml")
if err != nil {
log.Fatalf("Failed to read config file: %v", err)
}
var config BackupConfig
if err := yaml.Unmarshal(configFile, &config); err != nil {
log.Fatalf("Failed to parse config: %v", err)
}
// Create S3 client
cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(config.S3.Region))
if err != nil {
log.Fatalf("Failed to load AWS config: %v", err)
}
s3Client := s3.NewFromConfig(cfg)
// Set up cron scheduler
c := cron.New()
_, err = c.AddFunc(config.Schedule, func() {
if err := backupPrometheus(config, s3Client); err != nil {
log.Printf("Backup failed: %v", err)
}
})
if err != nil {
log.Fatalf("Failed to schedule backup: %v", err)
}
// Run cleanup on a daily basis
_, err = c.AddFunc("0 0 * * *", func() {
if err := cleanupOldBackups(config, s3Client); err != nil {
log.Printf("Cleanup failed: %v", err)
}
})
if err != nil {
log.Fatalf("Failed to schedule cleanup: %v", err)
}
// Start cron scheduler
c.Start()
// Keep the main thread running
select {}
}
func backupPrometheus(config BackupConfig, s3Client *s3.Client) error {
log.Println("Starting Prometheus backup...")
// Create temporary directory for backup
tempDir, err := os.MkdirTemp("", "prometheus-backup-")
if err != nil {
return fmt.Errorf("failed to create temp directory: %v", err)
}
defer os.RemoveAll(tempDir)
// Take Prometheus snapshot
snapshotDir, err := takePrometheusSnapshot(config.Prometheus.URL, config.Prometheus.Username, config.Prometheus.Password)
if err != nil {
return fmt.Errorf("failed to take Prometheus snapshot: %v", err)
}
// Create backup archive
backupFile := filepath.Join(tempDir, fmt.Sprintf("prometheus-backup-%s.tar.gz", time.Now().Format("20060102-150405")))
if err := createBackupArchive(snapshotDir, backupFile); err != nil {
return fmt.Errorf("failed to create backup archive: %v", err)
}
// Upload to S3
if err := uploadToS3(s3Client, backupFile, config.S3.Bucket, config.S3.KeyPrefix); err != nil {
return fmt.Errorf("failed to upload to S3: %v", err)
}
log.Println("Prometheus backup completed successfully")
return nil
}
func takePrometheusSnapshot(prometheusURL, username, password string) (string, error) {
// Create HTTP client with authentication if provided
var cmd *exec.Cmd
if username != "" && password != "" {
cmd = exec.Command("curl", "-X", "POST", "-u", fmt.Sprintf("%s:%s", username, password), fmt.Sprintf("%s/api/v1/admin/tsdb/snapshot", prometheusURL))
} else {
cmd = exec.Command("curl", "-X", "POST", fmt.Sprintf("%s/api/v1/admin/tsdb/snapshot", prometheusURL))
}
output, err := cmd.CombinedOutput()
if err != nil {
return "", fmt.Errorf("snapshot API call failed: %v, output: %s", err, output)
}
// Parse response to get snapshot directory
// This is simplified - in a real implementation, you would parse the JSON response
snapshotDir := "/prometheus/snapshots/latest"
return snapshotDir, nil
}
func createBackupArchive(sourceDir, targetFile string) error {
cmd := exec.Command("tar", "-czf", targetFile, "-C", filepath.Dir(sourceDir), filepath.Base(sourceDir))
output, err := cmd.CombinedOutput()
if err != nil {
return fmt.Errorf("tar command failed: %v, output: %s", err, output)
}
return nil
}
func uploadToS3(client *s3.Client, filePath, bucket, keyPrefix string) error {
file, err := os.Open(filePath)
if err != nil {
return fmt.Errorf("failed to open file: %v", err)
}
defer file.Close()
key := fmt.Sprintf("%s/%s", keyPrefix, filepath.Base(filePath))
_, err = client.PutObject(context.TODO(), &s3.PutObjectInput{
Bucket: &bucket,
Key: &key,
Body: file,
})
if err != nil {
return fmt.Errorf("failed to upload to S3: %v", err)
}
return nil
}
func cleanupOldBackups(config BackupConfig, client *s3.Client) error {
log.Println("Starting cleanup of old backups...")
// Calculate cutoff date
cutoffDate := time.Now().AddDate(0, 0, -config.Retention.Days)
// List objects in bucket
resp, err := client.ListObjectsV2(context.TODO(), &s3.ListObjectsV2Input{
Bucket: &config.S3.Bucket,
Prefix: &config.S3.KeyPrefix,
})
if err != nil {
return fmt.Errorf("failed to list objects in S3: %v", err)
}
// Identify objects to delete
var objectsToDelete []types.ObjectIdentifier
for _, obj := range resp.Contents {
if obj.LastModified.Before(cutoffDate) {
objectsToDelete = append(objectsToDelete, types.ObjectIdentifier{
Key: obj.Key,
})
}
}
// Delete old objects
if len(objectsToDelete) > 0 {
_, err = client.DeleteObjects(context.TODO(), &s3.DeleteObjectsInput{
Bucket: &config.S3.Bucket,
Delete: &types.Delete{
Objects: objectsToDelete,
Quiet: true,
},
})
if err != nil {
return fmt.Errorf("failed to delete old backups: %v", err)
}
log.Printf("Deleted %d old backups", len(objectsToDelete))
} else {
log.Println("No old backups to delete")
}
return nil
}
• Implemented a Prometheus storage validation tool:
// prometheus_storage_validator.rs
use chrono::{DateTime, Duration, Utc};
use reqwest::Client;
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::error::Error;
use std::fs;
use std::path::{Path, PathBuf};
use std::process::Command;
use structopt::StructOpt;
use tokio::time;
#[derive(Debug, StructOpt)]
#[structopt(name = "prometheus-storage-validator", about = "Validates Prometheus TSDB storage")]
struct Opt {
#[structopt(short, long, help = "Prometheus data directory")]
data_dir: PathBuf,
#[structopt(short, long, default_value = "http://localhost:9090", help = "Prometheus API URL")]
prometheus_url: String,
#[structopt(short, long, help = "Run repair if issues are found")]
repair: bool,
#[structopt(long, default_value = "24h", help = "Time range to validate (e.g. 24h, 7d)")]
time_range: String,
#[structopt(long, help = "Output report to file")]
report_file: Option<PathBuf>,
}
#[derive(Debug, Serialize, Deserialize)]
struct ValidationReport {
timestamp: DateTime<Utc>,
data_dir: PathBuf,
prometheus_url: String,
time_range: String,
blocks_checked: usize,
blocks_with_issues: usize,
issues: Vec<Issue>,
metrics_checked: usize,
metrics_with_issues: usize,
metric_issues: Vec<MetricIssue>,
repairs_performed: Vec<Repair>,
}
#[derive(Debug, Serialize, Deserialize)]
struct Issue {
block_id: String,
issue_type: String,
description: String,
severity: String,
repairable: bool,
}
#[derive(Debug, Serialize, Deserialize)]
struct MetricIssue {
metric_name: String,
issue_type: String,
description: String,
time_range: String,
}
#[derive(Debug, Serialize, Deserialize)]
struct Repair {
block_id: String,
repair_type: String,
description: String,
success: bool,
error: Option<String>,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let opt = Opt::from_args();
println!("Starting Prometheus storage validation...");
println!("Data directory: {}", opt.data_dir.display());
println!("Prometheus URL: {}", opt.prometheus_url);
println!("Time range: {}", opt.time_range);
println!("Repair mode: {}", if opt.repair { "enabled" } else { "disabled" });
// Validate data directory
if !opt.data_dir.exists() {
return Err(format!("Data directory does not exist: {}", opt.data_dir.display()).into());
}
// Initialize report
let mut report = ValidationReport {
timestamp: Utc::now(),
data_dir: opt.data_dir.clone(),
prometheus_url: opt.prometheus_url.clone(),
time_range: opt.time_range.clone(),
blocks_checked: 0,
blocks_with_issues: 0,
issues: Vec::new(),
metrics_checked: 0,
metrics_with_issues: 0,
metric_issues: Vec::new(),
repairs_performed: Vec::new(),
};
// Check TSDB blocks
check_tsdb_blocks(&opt.data_dir, &mut report)?;
// Check metrics via API
check_metrics_via_api(&opt.prometheus_url, &opt.time_range, &mut report).await?;
// Perform repairs if requested
if opt.repair && !report.issues.is_empty() {
perform_repairs(&opt.data_dir, &mut report)?;
}
// Print summary
print_summary(&report);
// Save report if requested
if let Some(report_file) = opt.report_file {
save_report(&report, &report_file)?;
}
Ok(())
}
fn check_tsdb_blocks(data_dir: &Path, report: &mut ValidationReport) -> Result<(), Box<dyn Error>> {
println!("Checking TSDB blocks...");
// Find all block directories
let blocks_dir = data_dir.join("blocks");
if !blocks_dir.exists() {
return Err(format!("Blocks directory does not exist: {}", blocks_dir.display()).into());
}
let mut block_dirs = Vec::new();
for entry in fs::read_dir(&blocks_dir)? {
let entry = entry?;
let path = entry.path();
if path.is_dir() {
block_dirs.push(path);
}
}
report.blocks_checked = block_dirs.len();
println!("Found {} blocks to check", block_dirs.len());
// Check each block
for block_dir in block_dirs {
let block_id = block_dir.file_name().unwrap().to_string_lossy().to_string();
println!("Checking block: {}", block_id);
// Check index file
let index_file = block_dir.join("index");
if !index_file.exists() {
report.issues.push(Issue {
block_id: block_id.clone(),
issue_type: "Missing index file".to_string(),
description: format!("Block {} is missing its index file", block_id),
severity: "Critical".to_string(),
repairable: false,
});
report.blocks_with_issues += 1;
continue;
}
// Check meta.json file
let meta_file = block_dir.join("meta.json");
if !meta_file.exists() {
report.issues.push(Issue {
block_id: block_id.clone(),
issue_type: "Missing meta.json file".to_string(),
description: format!("Block {} is missing its meta.json file", block_id),
severity: "Critical".to_string(),
repairable: false,
});
report.blocks_with_issues += 1;
continue;
}
// Check chunks directory
let chunks_dir = block_dir.join("chunks");
if !chunks_dir.exists() {
report.issues.push(Issue {
block_id: block_id.clone(),
issue_type: "Missing chunks directory".to_string(),
description: format!("Block {} is missing its chunks directory", block_id),
severity: "Critical".to_string(),
repairable: false,
});
report.blocks_with_issues += 1;
continue;
}
// Check for tombstones
let tombstones_file = block_dir.join("tombstones");
if tombstones_file.exists() {
// This is not necessarily an issue, but worth noting
report.issues.push(Issue {
block_id: block_id.clone(),
issue_type: "Tombstones present".to_string(),
description: format!("Block {} has tombstones which may indicate deleted series", block_id),
severity: "Info".to_string(),
repairable: false,
});
}
// Run promtool to check block
let output = Command::new("promtool")
.arg("tsdb")
.arg("block")
.arg("verify")
.arg("--repair=false")
.arg(block_dir.to_str().unwrap())
.output();
match output {
Ok(output) => {
if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr);
report.issues.push(Issue {
block_id: block_id.clone(),
issue_type: "Block verification failed".to_string(),
description: format!("Block {} failed verification: {}", block_id, stderr),
severity: "High".to_string(),
repairable: true,
});
report.blocks_with_issues += 1;
}
}
Err(e) => {
report.issues.push(Issue {
block_id: block_id.clone(),
issue_type: "Block verification error".to_string(),
description: format!("Failed to run verification on block {}: {}", block_id, e),
severity: "Medium".to_string(),
repairable: false,
});
report.blocks_with_issues += 1;
}
}
}
Ok(())
}
async fn check_metrics_via_api(prometheus_url: &str, time_range: &str, report: &mut ValidationReport) -> Result<(), Box<dyn Error>> {
println!("Checking metrics via API...");
let client = Client::new();
// Get list of metrics
let metrics_url = format!("{}/api/v1/label/__name__/values", prometheus_url);
let response = client.get(&metrics_url).send().await?;
let metrics_response: serde_json::Value = response.json().await?;
let metrics = match metrics_response["data"].as_array() {
Some(metrics) => metrics.iter().filter_map(|m| m.as_str().map(|s| s.to_string())).collect::<Vec<_>>(),
None => return Err("Failed to parse metrics list".into()),
};
report.metrics_checked = metrics.len();
println!("Found {} metrics to check", metrics.len());
// Parse time range
let duration = parse_duration(time_range)?;
let end_time = Utc::now();
let start_time = end_time - duration;
// Check a sample of metrics (limit to 100 to avoid overloading Prometheus)
let metrics_to_check = if metrics.len() > 100 {
let mut sampled = Vec::with_capacity(100);
let step = metrics.len() / 100;
for i in (0..metrics.len()).step_by(step) {
sampled.push(metrics[i].clone());
if sampled.len() >= 100 {
break;
}
}
sampled
} else {
metrics
};
for metric in metrics_to_check {
println!("Checking metric: {}", metric);
// Query metric data
let query_url = format!(
"{}/api/v1/query_range?query={}&start={}&end={}&step=1h",
prometheus_url,
metric,
start_time.timestamp(),
end_time.timestamp()
);
let response = client.get(&query_url).send().await?;
let query_response: serde_json::Value = response.json().await?;
// Check for errors
if query_response["status"].as_str() != Some("success") {
let error = query_response["error"].as_str().unwrap_or("Unknown error");
report.metric_issues.push(MetricIssue {
metric_name: metric.clone(),
issue_type: "Query error".to_string(),
description: format!("Error querying metric {}: {}", metric, error),
time_range: time_range.to_string(),
});
report.metrics_with_issues += 1;
continue;
}
// Check for empty results
let results = match query_response["data"]["result"].as_array() {
Some(results) => results,
None => {
report.metric_issues.push(MetricIssue {
metric_name: metric.clone(),
issue_type: "Empty result".to_string(),
description: format!("Metric {} returned no data", metric),
time_range: time_range.to_string(),
});
report.metrics_with_issues += 1;
continue;
}
};
if results.is_empty() {
report.metric_issues.push(MetricIssue {
metric_name: metric.clone(),
issue_type: "Empty result".to_string(),
description: format!("Metric {} returned no data", metric),
time_range: time_range.to_string(),
});
report.metrics_with_issues += 1;
continue;
}
// Check for data gaps
for result in results {
let values = match result["values"].as_array() {
Some(values) => values,
None => continue,
};
if values.is_empty() {
continue;
}
// Check for large gaps in timestamps
let mut prev_timestamp = None;
let mut gaps = 0;
for value in values {
let timestamp = match value[0].as_f64() {
Some(ts) => ts,
None => continue,
};
if let Some(prev_ts) = prev_timestamp {
// Check for gaps larger than 2x the step size (assuming 1h step)
if timestamp - prev_ts > 7200.0 {
gaps += 1;
}
}
prev_timestamp = Some(timestamp);
}
if gaps > 0 && gaps > values.len() / 10 {
// If more than 10% of the data points have gaps
report.metric_issues.push(MetricIssue {
metric_name: metric.clone(),
issue_type: "Data gaps".to_string(),
description: format!("Metric {} has {} significant gaps in data", metric, gaps),
time_range: time_range.to_string(),
});
report.metrics_with_issues += 1;
break;
}
}
// Avoid overloading Prometheus
time::sleep(time::Duration::from_millis(100)).await;
}
Ok(())
}
fn perform_repairs(data_dir: &Path, report: &mut ValidationReport) -> Result<(), Box<dyn Error>> {
println!("Performing repairs...");
for issue in &report.issues {
if !issue.repairable {
continue;
}
println!("Attempting to repair block: {}", issue.block_id);
let block_dir = data_dir.join("blocks").join(&issue.block_id);
if !block_dir.exists() {
report.repairs_performed.push(Repair {
block_id: issue.block_id.clone(),
repair_type: "Block not found".to_string(),
description: format!("Block directory {} does not exist", issue.block_id),
success: false,
error: Some("Block directory not found".to_string()),
});
continue;
}
// Run promtool to repair block
let output = Command::new("promtool")
.arg("tsdb")
.arg("block")
.arg("verify")
.arg("--repair=true")
.arg(block_dir.to_str().unwrap())
.output();
match output {
Ok(output) => {
if output.status.success() {
report.repairs_performed.push(Repair {
block_id: issue.block_id.clone(),
repair_type: "Block verification repair".to_string(),
description: format!("Successfully repaired block {}", issue.block_id),
success: true,
error: None,
});
} else {
let stderr = String::from_utf8_lossy(&output.stderr);
report.repairs_performed.push(Repair {
block_id: issue.block_id.clone(),
repair_type: "Block verification repair".to_string(),
description: format!("Failed to repair block {}", issue.block_id),
success: false,
error: Some(stderr.to_string()),
});
}
}
Err(e) => {
report.repairs_performed.push(Repair {
block_id: issue.block_id.clone(),
repair_type: "Block verification repair".to_string(),
description: format!("Failed to run repair on block {}", issue.block_id),
success: false,
error: Some(e.to_string()),
});
}
}
}
Ok(())
}
fn print_summary(report: &ValidationReport) {
println!("\nValidation Summary:");
println!("------------------");
println!("Timestamp: {}", report.timestamp);
println!("Data directory: {}", report.data_dir.display());
println!("Prometheus URL: {}", report.prometheus_url);
println!("Time range: {}", report.time_range);
println!("Blocks checked: {}", report.blocks_checked);
println!("Blocks with issues: {}", report.blocks_with_issues);
println!("Metrics checked: {}", report.metrics_checked);
println!("Metrics with issues: {}", report.metrics_with_issues);
println!("Repairs performed: {}", report.repairs_performed.len());
if !report.issues.is_empty() {
println!("\nBlock Issues:");
for issue in &report.issues {
println!("- [{}] Block {}: {} - {}", issue.severity, issue.block_id, issue.issue_type, issue.description);
}
}
if !report.metric_issues.is_empty() {
println!("\nMetric Issues:");
for issue in &report.metric_issues {
println!("- Metric {}: {} - {}", issue.metric_name, issue.issue_type, issue.description);
}
}
if !report.repairs_performed.is_empty() {
println!("\nRepairs:");
for repair in &report.repairs_performed {
let status = if repair.success { "SUCCESS" } else { "FAILED" };
println!("- [{}] Block {}: {} - {}", status, repair.block_id, repair.repair_type, repair.description);
if let Some(error) = &repair.error {
println!(" Error: {}", error);
}
}
}
}
fn save_report(report: &ValidationReport, file_path: &Path) -> Result<(), Box<dyn Error>> {
let json = serde_json::to_string_pretty(report)?;
fs::write(file_path, json)?;
println!("Report saved to {}", file_path.display());
Ok(())
}
fn parse_duration(duration_str: &str) -> Result<Duration, Box<dyn Error>> {
let mut total_seconds = 0;
let mut current_number = String::new();
for c in duration_str.chars() {
if c.is_digit(10) {
current_number.push(c);
} else {
let number = current_number.parse::<i64>()?;
current_number.clear();
match c {
's' => total_seconds += number,
'm' => total_seconds += number * 60,
'h' => total_seconds += number * 3600,
'd' => total_seconds += number * 86400,
'w' => total_seconds += number * 604800,
_ => return Err(format!("Unknown duration unit: {}", c).into()),
}
}
}
if !current_number.is_empty() {
return Err("Duration string must end with a unit".into());
}
Ok(Duration::seconds(total_seconds))
}
Lessons Learned:
Prometheus storage requires careful management and monitoring to prevent data loss.
How to Avoid:
Implement proper storage configuration with appropriate retention settings.
Use high-quality storage with consistent performance.
Set up regular backups of Prometheus data.
Monitor Prometheus storage metrics and set alerts for potential issues.
Implement a high-availability setup for critical monitoring infrastructure.
No summary provided
What Happened:
During a network partition event, the monitoring system generated over 500 alerts in less than 10 minutes. The alerts came from different services and components, making it difficult to identify the root cause. The on-call engineer spent significant time triaging alerts rather than addressing the underlying network issue, extending the outage duration.
Diagnosis Steps:
Analyzed alert patterns and timestamps to identify the first alerts.
Reviewed alert routing and grouping configurations.
Examined alert dependencies and relationships.
Tested alert correlation with simulated failures.
Reviewed on-call response procedures and documentation.
Root Cause:
The investigation revealed multiple issues with the alert configuration: 1. Alert correlation was not properly configured in Alertmanager 2. Dependency mapping between services was missing 3. Alert priority and severity were inconsistently defined 4. Alerting thresholds were too sensitive for dependent services 5. Silencing mechanisms were not properly utilized during incident response
Fix/Workaround:
• Short-term: Implemented improved Alertmanager configuration:
# Before: Problematic Alertmanager configuration
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-pager'
receivers:
- name: 'team-pager'
pagerduty_configs:
- service_key: '<secret>'
# After: Improved Alertmanager configuration with correlation
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-pager'
routes:
- match:
severity: critical
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 1m
repeat_interval: 1h
receiver: 'team-pager'
continue: true
- match:
severity: warning
group_by: ['alertname', 'cluster', 'service']
group_wait: 2m
group_interval: 5m
repeat_interval: 3h
receiver: 'team-email'
- match_re:
service: ^(network|dns|connectivity).*
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 1m
receiver: 'network-team'
routes:
- match:
severity: critical
receiver: 'network-pager'
inhibit_rules:
- source_match:
severity: 'critical'
alertname: 'NetworkPartition'
target_match:
severity: 'warning'
equal: ['cluster']
- source_match:
severity: 'critical'
alertname: 'NodeNetworkUnavailable'
target_match_re:
alertname: '.*Unavailable'
equal: ['instance', 'cluster']
receivers:
- name: 'team-pager'
pagerduty_configs:
- service_key: '<secret>'
- name: 'team-email'
email_configs:
- to: 'team@example.com'
- name: 'network-team'
email_configs:
- to: 'network-team@example.com'
- name: 'network-pager'
pagerduty_configs:
- service_key: '<network-secret>'
• Implemented alert dependency mapping in Prometheus rules:
groups:
- name: network.rules
rules:
- alert: NetworkPartition
expr: sum(up{job="node-exporter"}) by (cluster) / count(up{job="node-exporter"}) by (cluster) < 0.7
for: 1m
labels:
severity: critical
service: network
annotations:
summary: "Network partition detected in cluster {{ $labels.cluster }}"
description: "{{ $value | humanizePercentage }} of nodes are unreachable, indicating a network partition."
runbook_url: "https://runbooks.example.com/network/partition.md"
- name: application.rules
rules:
- alert: ServiceUnavailable
expr: up{job=~".*-service"} == 0
for: 2m
labels:
severity: warning
service: "{{ $labels.job }}"
annotations:
summary: "Service {{ $labels.job }} is down"
description: "Service {{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes."
runbook_url: "https://runbooks.example.com/services/unavailable.md"
• Long-term: Implemented a comprehensive alert correlation system:
- Created a service dependency graph for intelligent alert correlation
- Implemented automated root cause analysis
- Developed alert noise reduction algorithms
- Established clear alert severity definitions and escalation paths
- Implemented regular alert configuration reviews
Lessons Learned:
Effective alert correlation is critical for rapid incident response.
How to Avoid:
Map service dependencies and use them for alert correlation.
Implement inhibition rules for dependent alerts.
Define clear alert severity levels and escalation paths.
Regularly review and test alert configurations.
Implement automated root cause analysis tools.
No summary provided
What Happened:
During a routine update to the monitoring stack, the operations team deployed a new version of the Prometheus Node Exporter to all production servers. Within hours, system administrators began receiving alerts about high CPU and memory usage across multiple servers. Investigation revealed that the Node Exporter processes were consuming excessive resources, causing performance degradation of critical applications.
Diagnosis Steps:
Identified servers with the highest resource utilization.
Examined Node Exporter process behavior and resource consumption.
Compared configuration between working and failing instances.
Reviewed recent changes to the monitoring stack.
Analyzed Prometheus scrape configurations and metrics.
Root Cause:
The investigation revealed multiple issues with the monitoring agent deployment: 1. The new Node Exporter version had a memory leak when certain collectors were enabled 2. The deployment included a configuration change that enabled additional resource-intensive collectors 3. Prometheus scrape intervals were too aggressive for the number of targets 4. No resource limits were set for the monitoring agents 5. The monitoring system lacked isolation from the applications it was monitoring
Fix/Workaround:
• Implemented immediate fixes to restore system stability
• Rolled back to the previous Node Exporter version
• Adjusted collector configurations to disable problematic ones
• Implemented proper resource limits for monitoring agents
• Created a phased deployment strategy for future updates
Lessons Learned:
Monitoring systems can themselves become a source of production issues if not properly managed.
How to Avoid:
Implement proper resource limits for all monitoring components.
Test monitoring agent updates in staging before production deployment.
Deploy monitoring updates gradually with careful observation.
Maintain separate monitoring for the monitoring system itself.
Create automated rollback procedures for monitoring components.
No summary provided
What Happened:
During a network partition event in a production environment, the monitoring system generated hundreds of alerts within minutes. The alert storm overwhelmed notification channels, flooded chat rooms, and triggered multiple pages to the on-call team. The excessive noise made it difficult to identify the root cause and prioritize response actions. The incident response was delayed as the team struggled to filter through the noise to find actionable information.
Diagnosis Steps:
Analyzed alert patterns and timing.
Reviewed alert configurations and dependencies.
Examined alert grouping and routing rules.
Assessed the alert prioritization mechanism.
Evaluated the incident response workflow.
Root Cause:
The investigation revealed multiple issues with alert management: 1. No proper alert dependency mapping was implemented 2. Alert thresholds were set too sensitively 3. Lack of alert grouping and correlation 4. No alert severity classification or prioritization 5. Insufficient filtering of secondary and symptom alerts
Fix/Workaround:
• Implemented immediate improvements to alert management
• Created alert dependency mapping
• Configured proper alert grouping and correlation
• Established alert severity classification
• Developed intelligent alert routing based on context
Lessons Learned:
Effective alert management requires thoughtful design to prevent alert fatigue and enable rapid incident response.
How to Avoid:
Implement alert dependency mapping to reduce symptom alerts.
Configure proper alert grouping and correlation.
Establish clear alert severity classification and prioritization.
Design intelligent routing based on alert context and severity.
Regularly review and optimize alert configurations.
No summary provided
What Happened:
A large e-commerce company was experiencing a significant increase in traffic during a major sales event. As the load increased, several services began to degrade, triggering alerts. However, just as the incident response team began investigating, the monitoring system itself became unresponsive. The Prometheus servers were overwhelmed by the volume of metrics being collected, and the Grafana dashboards became unavailable. The operations team was left without visibility into the state of the infrastructure during a critical incident, significantly hampering their ability to diagnose and resolve the underlying issues.
Diagnosis Steps:
Attempted to access monitoring dashboards and confirmed they were unresponsive.
Checked the status of the monitoring infrastructure components.
Examined logs from the Prometheus and Grafana servers.
Analyzed resource utilization on the monitoring infrastructure.
Reviewed recent changes to monitoring configuration.
Root Cause:
The investigation revealed multiple issues with the monitoring infrastructure: 1. The Prometheus servers were undersized for the volume of metrics being collected 2. No horizontal scaling was configured for the monitoring components 3. Retention policies were not properly configured, leading to excessive disk usage 4. The monitoring system itself lacked adequate monitoring and alerting 5. There was no fallback monitoring system or redundancy
Fix/Workaround:
• Implemented immediate improvements to monitoring infrastructure
• Increased resources allocated to Prometheus servers
• Configured horizontal scaling for monitoring components
• Implemented proper retention policies
• Established a secondary, independent monitoring system
• Created alerts specifically for monitoring system health
Lessons Learned:
Monitoring systems are critical infrastructure that require the same level of resilience and scaling considerations as production services.
How to Avoid:
Design monitoring systems with appropriate scaling capabilities.
Implement redundancy for critical monitoring components.
Monitor the monitoring system itself with independent tools.
Regularly test monitoring system performance under load.
Establish clear retention policies to manage resource usage.