A database failure caused a significant outage, but the team didn't respond promptly because they had become desensitized to alerts after weeks of receiving hundreds of non-actionable notifications.
# Monitoring and Logging Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Reviewed alert history in PagerDuty and found the critical alert buried among many others.
Analyzed alert patterns and response times over the past month.
Surveyed team members about alert fatigue and notification practices.
Audited alert configurations in Prometheus.
Root Cause:
Poor alert tuning led to hundreds of low-value notifications. The team had started ignoring alerts due to the high false positive rate.
Fix/Workaround:
• Implemented alert severity levels with different notification channels.
• Reduced noisy alerts by adjusting thresholds based on historical data.
• Created runbooks for common alerts to speed up response.
• Implemented alert correlation to group related issues.
Lessons Learned:
More alerts don't mean better monitoring; quality and actionability matter more than quantity.
How to Avoid:
Follow the RED method (Rate, Errors, Duration) for service monitoring.
Implement alert severity levels (P1-P4) with appropriate response expectations.
Regularly review and tune alerts based on actionability.
Use time-of-day and service impact to determine alert routing.
Create clear ownership for each alert to avoid diffusion of responsibility.
No summary provided
What Happened:
A production incident occurred, but when the team tried to investigate, they found gaps in monitoring data precisely during the critical period. The monitoring infrastructure had itself been affected by the same issue.
Diagnosis Steps:
Checked Prometheus targets and found many in "down" state.
Reviewed monitoring server logs and discovered disk I/O issues.
Analyzed system metrics from the monitoring servers themselves.
Found that the monitoring stack was deployed on the same infrastructure as the services it monitored.
Root Cause:
The monitoring infrastructure was not isolated from the production environment. When high load caused resource contention, monitoring was one of the first services to fail.
Fix/Workaround:
• Moved monitoring infrastructure to a separate, dedicated cluster.
• Implemented remote write to send metrics to a secondary backup system.
• Added monitoring for the monitoring system itself with separate alerting.
• Increased resource allocation for monitoring components.
Lessons Learned:
Monitoring systems need to be more resilient than the systems they monitor.
How to Avoid:
Isolate monitoring infrastructure from production environments.
Implement redundant monitoring with different failure domains.
Set up cross-region or cross-provider monitoring backups.
Monitor the monitoring system with independent tools.
Implement local metric buffering to handle temporary network issues.
No summary provided
What Happened:
Prometheus servers began crashing with OOM errors. The system was trying to track millions of unique time series, consuming all available memory.
Diagnosis Steps:
Checked Prometheus logs and found "storage memory exhausted" errors.
Used prometheus_tsdb_head_series_created_total metric to track cardinality growth.
Analyzed label usage across metrics with high cardinality.
Found a recent deployment that added unique request IDs as labels to metrics.
Root Cause:
A new service was instrumenting metrics with high-cardinality labels (unique IDs, user IDs, and request IDs), causing an explosion in the number of time series.
Fix/Workaround:
• Removed high-cardinality labels from metrics.
• Implemented recording rules to aggregate data at collection time.
• Increased Prometheus memory limits temporarily.
• Added cardinality limits in the client libraries.
Lessons Learned:
Time series cardinality has a massive impact on monitoring system performance and stability.
How to Avoid:
Educate developers about cardinality best practices.
Implement label validation in CI/CD pipelines.
Monitor cardinality growth with alerts for unusual increases.
Use exemplars instead of labels for high-cardinality data.
Consider using alternative solutions like OpenTelemetry for high-cardinality use cases.
No summary provided
What Happened:
The Prometheus server began consuming excessive memory and CPU, eventually becoming unresponsive. Alerts stopped firing, and the Prometheus UI was inaccessible. This caused a monitoring blackout for critical production services.
Diagnosis Steps:
Examined Prometheus logs for error messages.
Checked memory and CPU usage patterns.
Used
promtool tsdb analyze
to examine time series data.Reviewed recent metric additions and changes.
Analyzed label cardinality with queries like:
```promql
count by (name) ({name=~".+"})
```
Root Cause:
A newly deployed application was generating metrics with high-cardinality labels (user IDs, session IDs, and request paths) without any filtering or aggregation. This created millions of unique time series, overwhelming Prometheus's storage and query capabilities.
Fix/Workaround:
• Short-term: Restarted Prometheus with stricter resource limits and added relabeling to drop the problematic metrics:
scrape_configs:
- job_name: 'problematic-app'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__name__]
regex: 'app_request_(user|session)_.*'
action: drop
• Long-term: Redesigned the application metrics to use appropriate aggregation and labeling:
// Before: Bad practice with high cardinality
requestCounter := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "app_request_total",
Help: "Total requests by user and path",
},
[]string{"user_id", "session_id", "path"}, // High cardinality!
)
// After: Better practice with controlled cardinality
requestCounter := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "app_request_total",
Help: "Total requests by status and path category",
},
[]string{"status", "path_category"}, // Low cardinality
)
// For user-specific metrics, use a separate system or sampling
• Implemented metric aggregation at the application level.
• Added cardinality monitoring and alerting.
Lessons Learned:
High-cardinality metrics can quickly overwhelm time series databases.
How to Avoid:
Establish metric naming and labeling standards.
Review metrics for cardinality issues before production deployment.
Use recording rules to pre-aggregate high-cardinality metrics.
Implement metric relabeling to control cardinality.
Monitor and alert on time series growth.
No summary provided
What Happened:
Users reported missing logs and inconsistent search results. The operations team discovered that the Elasticsearch cluster had split into two separate clusters, each believing it was the authoritative source of truth.
Diagnosis Steps:
Checked cluster health with
GET _cluster/health
.Examined node status with
GET _cat/nodes?v
.Reviewed Elasticsearch logs for communication errors.
Analyzed network connectivity between nodes.
Checked discovery and quorum settings.
Root Cause:
The cluster experienced a network partition that isolated some nodes from others. The discovery configuration had an incorrect minimum_master_nodes setting (set too low), allowing both sides of the partition to elect their own master nodes and form separate clusters.
Fix/Workaround:
• Short-term: Resolved the network partition and restarted the cluster with proper settings.
• Long-term: Updated Elasticsearch configuration with proper discovery settings:
# elasticsearch.yml for version 7.x
discovery.seed_hosts:
- es-node-1
- es-node-2
- es-node-3
- es-node-4
- es-node-5
# For Elasticsearch 7.x
cluster.initial_master_nodes:
- es-node-1
- es-node-2
- es-node-3
# For older Elasticsearch versions
# discovery.zen.minimum_master_nodes: 3
• Implemented proper network redundancy.
• Added monitoring for cluster health and split-brain conditions:
// Watcher alert for node count discrepancies
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"http": {
"request": {
"url": "http://localhost:9200/_cat/nodes?format=json"
}
}
},
"condition": {
"script": {
"source": "return ctx.payload.length < 5" // Expected 5 nodes
}
},
"actions": {
"notify_ops": {
"webhook": {
"url": "https://alerts.example.com/elasticsearch",
"body": "{\"message\": \"Elasticsearch node count alert: {{ctx.payload.length}} nodes found (expected 5)\"}"
}
}
}
}
Lessons Learned:
Proper quorum settings are critical for distributed database stability.
How to Avoid:
Configure proper discovery and quorum settings.
Use an odd number of master-eligible nodes.
Implement redundant network connectivity.
Monitor cluster health and node count.
Test failure scenarios regularly.