During an automated certificate rotation, services began experiencing mutual TLS authentication failures. The issue started with intermittent 503 errors and gradually escalated to widespread service disruption across the mesh.
# API Gateway and Service Mesh Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Examined Istio proxy logs for authentication errors.
Checked certificate expiration dates and rotation status.
Verified Istio control plane component health.
Analyzed recent configuration changes and updates.
Tested certificate validation manually.
Root Cause:
The Istio certificate authority (Citadel) was unable to distribute new certificates due to a combination of issues: 1. The Kubernetes secret used for storing the root CA had incorrect permissions 2. A recent Istio upgrade changed the certificate rotation process without updating documentation 3. Custom certificate validation logic in some services rejected the new certificate format
Fix/Workaround:
• Short-term: Restored previous certificates and disabled automatic rotation:
# Patch to disable automatic rotation temporarily
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
certificates:
- secretName: cacerts
dnsNames:
- istio-ca.istio-system.svc
defaultConfig:
proxyMetadata:
ISTIO_META_CERT_ROTATION: "false"
• Long-term: Implemented proper certificate management:
# Proper Istio certificate configuration
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
components:
pilot:
k8s:
env:
- name: PILOT_CERT_PROVIDER
value: "istiod"
- name: PILOT_ENABLE_XDS_CACHE
value: "true"
istiod:
k8s:
overlays:
- apiVersion: apps/v1
kind: Deployment
name: istiod
patches:
- path: spec.template.spec.containers.[name:discovery].args[7]
value: "--caCertTTL=8760h"
- path: spec.template.spec.containers.[name:discovery].args[8]
value: "--workloadCertTTL=24h"
meshConfig:
defaultConfig:
proxyMetadata:
ISTIO_META_CERT_ROTATION: "true"
ISTIO_META_CERT_ROTATION_GRACE_PERIOD_RATIO: "0.2"
• Created a certificate monitoring solution:
// cert_monitor.go
package main
import (
"context"
"crypto/x509"
"encoding/pem"
"fmt"
"log"
"os"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
var (
certExpiryDays = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "istio_cert_expiry_days",
Help: "Days until certificate expiration",
},
[]string{"namespace", "secret_name", "cert_type"},
)
certRotationSuccess = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "istio_cert_rotation_success_total",
Help: "Total number of successful certificate rotations",
},
[]string{"namespace", "secret_name"},
)
certRotationFailure = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "istio_cert_rotation_failure_total",
Help: "Total number of failed certificate rotations",
},
[]string{"namespace", "secret_name", "reason"},
)
)
func main() {
// Set up Kubernetes client
config, err := rest.InClusterConfig()
if err != nil {
log.Fatalf("Failed to get cluster config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Fatalf("Failed to create Kubernetes client: %v", err)
}
// Start HTTP server for Prometheus metrics
http.Handle("/metrics", promhttp.Handler())
go func() {
log.Fatal(http.ListenAndServe(":8080", nil))
}()
// Monitor certificates
monitorCertificates(clientset)
}
func monitorCertificates(clientset *kubernetes.Clientset) {
for {
// Get all namespaces
namespaces, err := clientset.CoreV1().Namespaces().List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Failed to list namespaces: %v", err)
time.Sleep(5 * time.Minute)
continue
}
// Check certificates in each namespace
for _, namespace := range namespaces.Items {
ns := namespace.Name
// Get all secrets in the namespace
secrets, err := clientset.CoreV1().Secrets(ns).List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Failed to list secrets in namespace %s: %v", ns, err)
continue
}
// Check each secret for certificates
for _, secret := range secrets.Items {
// Check if this is a TLS secret
if secret.Type != "kubernetes.io/tls" && secret.Type != "istio.io/key-and-cert" {
continue
}
// Check certificate data
for key, data := range secret.Data {
if key == "ca.crt" || key == "tls.crt" || key == "cert-chain.pem" || key == "root-cert.pem" {
// Parse certificate
block, _ := pem.Decode(data)
if block == nil {
log.Printf("Failed to decode PEM block from %s in secret %s/%s", key, ns, secret.Name)
certRotationFailure.WithLabelValues(ns, secret.Name, "decode_failure").Inc()
continue
}
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
log.Printf("Failed to parse certificate from %s in secret %s/%s: %v", key, ns, secret.Name, err)
certRotationFailure.WithLabelValues(ns, secret.Name, "parse_failure").Inc()
continue
}
// Calculate days until expiration
expiryDays := time.Until(cert.NotAfter).Hours() / 24
certExpiryDays.WithLabelValues(ns, secret.Name, key).Set(expiryDays)
// Log warning if certificate is expiring soon
if expiryDays < 30 {
log.Printf("WARNING: Certificate %s in secret %s/%s expires in %.1f days", key, ns, secret.Name, expiryDays)
}
// Check if certificate was recently rotated
issuedDays := time.Since(cert.NotBefore).Hours() / 24
if issuedDays < 1 {
log.Printf("Certificate %s in secret %s/%s was recently rotated (%.1f hours ago)", key, ns, secret.Name, time.Since(cert.NotBefore).Hours())
certRotationSuccess.WithLabelValues(ns, secret.Name).Inc()
}
}
}
}
}
// Sleep before next check
time.Sleep(1 * time.Hour)
}
}
• Implemented a certificate rotation testing procedure:
#!/bin/bash
# test_cert_rotation.sh
set -euo pipefail
NAMESPACE=${1:-istio-system}
SECRET_NAME=${2:-istio-ca-secret}
WORKLOAD_NAMESPACE=${3:-default}
WORKLOAD_NAME=${4:-sleep}
echo "Testing certificate rotation for Istio in namespace $NAMESPACE"
# Check istiod status
echo "Checking istiod status..."
kubectl get pods -n $NAMESPACE -l app=istiod
# Check current root certificate
echo "Checking current root certificate..."
kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data.root-cert\.pem}' | base64 -d | openssl x509 -noout -text | grep "Validity" -A 2
# Check workload certificates
echo "Checking workload certificates..."
POD_NAME=$(kubectl get pod -n $WORKLOAD_NAMESPACE -l app=$WORKLOAD_NAME -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n $WORKLOAD_NAMESPACE $POD_NAME -c istio-proxy -- ls -la /var/run/secrets/istio/
# Get certificate expiry
kubectl exec -n $WORKLOAD_NAMESPACE $POD_NAME -c istio-proxy -- cat /var/run/secrets/istio/cert-chain.pem | openssl x509 -noout -text | grep "Validity" -A 2
# Trigger certificate rotation
echo "Triggering certificate rotation..."
kubectl delete secret $SECRET_NAME -n $NAMESPACE
# Wait for istiod to restart
echo "Waiting for istiod to restart..."
kubectl rollout restart deployment/istiod -n $NAMESPACE
kubectl rollout status deployment/istiod -n $NAMESPACE
# Wait for workload certificates to be rotated
echo "Waiting for workload certificates to be rotated..."
sleep 60
# Verify new certificates
echo "Verifying new certificates..."
kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data.root-cert\.pem}' | base64 -d | openssl x509 -noout -text | grep "Validity" -A 2
# Verify workload certificates
echo "Verifying workload certificates..."
kubectl exec -n $WORKLOAD_NAMESPACE $POD_NAME -c istio-proxy -- cat /var/run/secrets/istio/cert-chain.pem | openssl x509 -noout -text | grep "Validity" -A 2
# Test connectivity
echo "Testing connectivity..."
kubectl exec -n $WORKLOAD_NAMESPACE $POD_NAME -c sleep -- curl -s httpbin.default:8000/headers | grep "X-Forwarded-Client-Cert"
echo "Certificate rotation test completed successfully"
Lessons Learned:
Certificate management in service meshes requires careful planning and monitoring.
How to Avoid:
Implement certificate monitoring with alerts for upcoming expirations.
Test certificate rotation procedures regularly in non-production environments.
Document certificate management procedures and automate where possible.
Use longer-lived root certificates and shorter-lived workload certificates.
Implement graceful certificate rotation with overlapping validity periods.
No summary provided
What Happened:
A production API gateway started experiencing high CPU and memory usage, eventually leading to service degradation. The issue was traced to a single client making thousands of requests per second to a computationally expensive endpoint, despite rate limiting being configured.
Diagnosis Steps:
Analyzed API gateway logs and metrics to identify traffic patterns.
Examined rate limiting configuration and request headers.
Profiled API gateway performance during the incident.
Traced requests from the problematic client through the system.
Reviewed recent configuration changes to the API gateway.
Root Cause:
The client was able to bypass rate limiting by manipulating request headers. The rate limiting plugin was configured to use the X-Forwarded-For header for client identification, but the gateway was not validating or overwriting this header, allowing the client to spoof different IP addresses in each request.
Fix/Workaround:
• Short-term: Implemented immediate header validation and IP blocking:
# Kong API Gateway configuration patch
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: ip-restriction
spec:
plugin: ip-restriction
config:
deny:
- 203.0.113.0/24 # Malicious client IP range
• Reconfigured rate limiting to use multiple identifiers:
# Before: Vulnerable rate limiting configuration
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: rate-limiting
spec:
plugin: rate-limiting
config:
minute: 60
hour: 1000
limit_by: ip
policy: local
# After: Improved rate limiting configuration
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: rate-limiting
spec:
plugin: rate-limiting
config:
minute: 60
hour: 1000
limit_by: consumer
policy: redis
redis_host: redis.infrastructure
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
• Long-term: Implemented a comprehensive API security strategy:
-- Custom Kong plugin for advanced request validation
-- save as kong/plugins/advanced-request-validator/handler.lua
local BasePlugin = require "kong.plugins.base_plugin"
local iputils = require "resty.iputils"
local jwt_decoder = require "kong.plugins.jwt.jwt_parser"
local AdvancedRequestValidator = BasePlugin:extend()
AdvancedRequestValidator.PRIORITY = 1005
AdvancedRequestValidator.VERSION = "1.0.0"
function AdvancedRequestValidator:new()
AdvancedRequestValidator.super.new(self, "advanced-request-validator")
end
function AdvancedRequestValidator:access(conf)
AdvancedRequestValidator.super.access(self)
local request_headers = kong.request.get_headers()
local client_ip = kong.client.get_forwarded_ip()
local request_path = kong.request.get_path()
local request_method = kong.request.get_method()
-- 1. Validate and sanitize headers
if request_headers["x-forwarded-for"] then
-- Overwrite with trusted value from Kong
kong.service.request.set_header("X-Forwarded-For", client_ip)
end
-- 2. Check for suspicious patterns
local user_agent = request_headers["user-agent"]
if not user_agent or user_agent == "" or string.find(user_agent:lower(), "bot") then
kong.log.warn("Suspicious user agent detected: ", user_agent)
-- Increment counter for this IP
local suspicious_count = kong.ctx.shared.suspicious_count or 0
suspicious_count = suspicious_count + 1
kong.ctx.shared.suspicious_count = suspicious_count
-- If multiple suspicious requests, add to temporary block list
if suspicious_count > conf.suspicious_threshold then
kong.log.err("Adding IP to temporary block list: ", client_ip)
-- This would typically update a distributed cache or database
end
end
-- 3. Validate JWT tokens if present
local auth_header = request_headers["authorization"]
if auth_header and auth_header:find("Bearer") == 1 then
local token = auth_header:sub(8)
local jwt, err = jwt_decoder:new(token)
if err then
kong.log.err("Invalid JWT: ", err)
return kong.response.exit(401, { message = "Invalid authentication credentials" })
end
-- Check token claims
local claims = jwt.claims
if claims.exp and claims.exp < os.time() then
return kong.response.exit(401, { message = "Token expired" })
end
-- Check if token is in deny list
-- This would typically check a distributed cache or database
end
-- 4. Apply additional rate limiting for expensive endpoints
if conf.expensive_endpoints[request_path] and request_method == "POST" then
-- Apply stricter rate limits for expensive operations
-- This would typically use a distributed counter
end
end
return AdvancedRequestValidator
• Implemented a distributed rate limiting solution with Redis:
// rate_limiter.go
package main
import (
"context"
"fmt"
"log"
"net/http"
"strconv"
"time"
"github.com/go-redis/redis/v8"
"github.com/google/uuid"
)
// RateLimiter implements a distributed rate limiter using Redis
type RateLimiter struct {
redisClient *redis.Client
keyPrefix string
windowSize time.Duration
limit int
}
// NewRateLimiter creates a new rate limiter
func NewRateLimiter(redisAddr, keyPrefix string, windowSize time.Duration, limit int) *RateLimiter {
client := redis.NewClient(&redis.Options{
Addr: redisAddr,
Password: "", // no password set
DB: 0, // use default DB
})
return &RateLimiter{
redisClient: client,
keyPrefix: keyPrefix,
windowSize: windowSize,
limit: limit,
}
}
// Allow checks if a request is allowed based on the rate limit
func (rl *RateLimiter) Allow(ctx context.Context, identifier string) (bool, int, error) {
// Create a unique key for this identifier and window
now := time.Now().UnixNano()
windowStart := now - int64(rl.windowSize)
key := fmt.Sprintf("%s:%s", rl.keyPrefix, identifier)
// Use Redis pipeline for efficiency
pipe := rl.redisClient.Pipeline()
// Remove old entries outside the current window
pipe.ZRemRangeByScore(ctx, key, "0", strconv.FormatInt(windowStart, 10))
// Add current request with score as current timestamp
requestID := uuid.New().String()
pipe.ZAdd(ctx, key, &redis.Z{Score: float64(now), Member: requestID})
// Get the count of requests in the current window
countCmd := pipe.ZCard(ctx, key)
// Set expiration on the key to clean up old data
pipe.Expire(ctx, key, rl.windowSize*2)
// Execute the pipeline
_, err := pipe.Exec(ctx)
if err != nil {
return false, 0, err
}
// Get the count of requests in the current window
count, err := countCmd.Result()
if err != nil {
return false, 0, err
}
// Check if the count exceeds the limit
return count <= int64(rl.limit), int(count), nil
}
// RateLimitMiddleware creates a middleware for rate limiting
func RateLimitMiddleware(rl *RateLimiter, identifierFunc func(*http.Request) string) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Get identifier for this request
identifier := identifierFunc(r)
// Check if request is allowed
allowed, count, err := rl.Allow(r.Context(), identifier)
if err != nil {
log.Printf("Rate limiter error: %v", err)
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
return
}
// Set rate limit headers
w.Header().Set("X-RateLimit-Limit", strconv.Itoa(rl.limit))
w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(rl.limit-int(count)))
w.Header().Set("X-RateLimit-Reset", strconv.FormatInt(time.Now().Add(rl.windowSize).Unix(), 10))
if !allowed {
w.Header().Set("Retry-After", strconv.Itoa(int(rl.windowSize.Seconds())))
http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
return
}
next.ServeHTTP(w, r)
})
}
}
// GetClientIdentifier returns a function that extracts a client identifier from a request
func GetClientIdentifier(useMultipleFactors bool) func(*http.Request) string {
return func(r *http.Request) string {
if !useMultipleFactors {
// Simple IP-based identification
return r.RemoteAddr
}
// Multi-factor identification
userAgent := r.UserAgent()
authHeader := r.Header.Get("Authorization")
// Extract user ID from JWT if available
userID := "anonymous"
if authHeader != "" {
// In a real implementation, this would parse and validate the JWT
// For this example, we'll just use a placeholder
userID = "user-from-jwt"
}
// Combine factors
return fmt.Sprintf("%s:%s:%s", r.RemoteAddr, userAgent, userID)
}
}
func main() {
// Create a rate limiter with a 1-minute window and 60 requests limit
rateLimiter := NewRateLimiter("localhost:6379", "ratelimit", time.Minute, 60)
// Create a middleware that uses multiple factors for identification
middleware := RateLimitMiddleware(rateLimiter, GetClientIdentifier(true))
// Create a simple HTTP server with rate limiting
http.Handle("/api/", middleware(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello, you've requested: %s\n", r.URL.Path)
})))
log.Println("Starting server on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
• Implemented a comprehensive API security monitoring system:
# Prometheus alerting rules for API security
groups:
- name: api_security
rules:
- alert: HighRateLimitViolations
expr: sum(rate(kong_http_status{status="429"}[5m])) by (service) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High rate limit violations for {{ $labels.service }}"
description: "Service {{ $labels.service }} is experiencing high rate limit violations ({{ $value }} per second)"
- alert: UnusualRequestPatterns
expr: sum(rate(http_requests_total{status=~"4.."}[5m])) by (service) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual request patterns for {{ $labels.service }}"
description: "Service {{ $labels.service }} is receiving a high number of 4xx errors ({{ $value }} per second)"
- alert: PotentialAPIScraping
expr: sum(rate(http_requests_total[5m])) by (client_ip) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Potential API scraping from {{ $labels.client_ip }}"
description: "Client IP {{ $labels.client_ip }} is making a high number of requests ({{ $value }} per second)"
- alert: APIGatewayHighCPU
expr: avg(rate(container_cpu_usage_seconds_total{container=~"kong.*"}[5m])) by (pod) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "API Gateway pod {{ $labels.pod }} high CPU usage"
description: "API Gateway pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of CPU"
- alert: APIGatewayHighMemory
expr: avg(container_memory_usage_bytes{container=~"kong.*"} / container_spec_memory_limit_bytes{container=~"kong.*"}) by (pod) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "API Gateway pod {{ $labels.pod }} high memory usage"
description: "API Gateway pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of memory"
Lessons Learned:
API gateway security requires defense in depth with multiple validation layers.
How to Avoid:
Never trust client-provided headers for rate limiting or authentication.
Implement multiple identification factors for rate limiting.
Use distributed rate limiting for high-traffic APIs.
Monitor for unusual traffic patterns and implement automatic blocking.
Regularly audit API gateway configurations for security vulnerabilities.
No summary provided
What Happened:
A company's payment processing API suddenly experienced a significant performance degradation, with response times increasing from milliseconds to several seconds. The operations team observed a massive spike in traffic to specific API endpoints. Despite having rate limiting configured in the Kong API gateway, the attacker was able to bypass these controls and overwhelm the backend services.
Diagnosis Steps:
Analyzed API gateway logs to identify traffic patterns.
Examined rate limiting configuration and behavior.
Reviewed client IP addresses and request headers.
Monitored backend service performance metrics.
Analyzed the attack pattern and request signatures.
Root Cause:
The investigation revealed multiple issues with the rate limiting implementation: 1. Rate limiting was configured based solely on client IP addresses 2. The attacker used multiple proxy servers to distribute requests across many source IPs 3. The API gateway was not configured to detect and block distributed attacks 4. Rate limits were set too high for sensitive endpoints 5. The API gateway wasn't validating API keys properly for some endpoints
Fix/Workaround:
• Short-term: Implemented immediate mitigations:
# Before: Simple IP-based rate limiting in Kong
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: rate-limiting
namespace: api-gateway
config:
minute: 60
limit_by: ip
policy: local
fault_tolerant: true
hide_client_headers: false
redis_ssl: false
redis_ssl_verify: false
# After: Enhanced rate limiting with multiple identifiers
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: advanced-rate-limiting
namespace: api-gateway
config:
minute: 30
hour: 500
limit_by: credential,header,ip
header_name: X-Forwarded-For
path: null
policy: redis
fault_tolerant: true
hide_client_headers: false
redis_host: redis.rate-limiting
redis_port: 6379
redis_password: ${REDIS_PASSWORD}
redis_timeout: 2000
redis_database: 0
redis_ssl: true
redis_ssl_verify: true
• Implemented request validation to prevent malformed requests:
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: request-validator
namespace: api-gateway
plugin: request-validator
config:
body_schema: |
{
"type": "object",
"properties": {
"payment_id": { "type": "string", "pattern": "^[a-zA-Z0-9-_]+$" },
"amount": { "type": "number", "minimum": 0.01 },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP", "JPY"] },
"description": { "type": "string", "maxLength": 255 }
},
"required": ["payment_id", "amount", "currency"]
}
verbose_response: false
allowed_content_types:
- application/json
• Long-term: Implemented a comprehensive API security solution:
// api_security.go - Advanced rate limiting and security service
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"strconv"
"strings"
"time"
"github.com/go-redis/redis/v8"
"github.com/gorilla/mux"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Configuration constants
const (
DefaultRateLimit = 30
DefaultRateLimitWindow = time.Minute
DefaultBurstLimit = 5
DefaultBlockDuration = 1 * time.Hour
SuspiciousThreshold = 3
BlockThreshold = 5
)
// Metrics
var (
requestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "api_requests_total",
Help: "Total number of API requests",
},
[]string{"path", "method", "status"},
)
requestsBlocked = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "api_requests_blocked",
Help: "Total number of blocked API requests",
},
[]string{"path", "method", "reason"},
)
rateLimitExceeded = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "api_rate_limit_exceeded",
Help: "Total number of rate limit exceeded events",
},
[]string{"path", "method", "client_id"},
)
clientBlockedCount = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "api_clients_blocked",
Help: "Number of currently blocked clients",
},
[]string{"reason"},
)
requestLatency = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "api_request_duration_seconds",
Help: "API request latency in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"path", "method"},
)
)
// ClientIdentifier contains all information used to identify a client
type ClientIdentifier struct {
ClientIP string
APIKey string
UserAgent string
XForwardedFor string
SessionID string
AccountID string
RequestSignature string
}
// RateLimitConfig defines rate limiting parameters for an endpoint
type RateLimitConfig struct {
Path string
Method string
Limit int
Window time.Duration
BurstLimit int
BlockDuration time.Duration
SensitiveEndpoint bool
}
// RateLimiter handles rate limiting logic
type RateLimiter struct {
redisClient *redis.Client
configs map[string]RateLimitConfig
defaultConfig RateLimitConfig
}
// NewRateLimiter creates a new rate limiter
func NewRateLimiter(redisAddr, redisPassword string, db int) (*RateLimiter, error) {
client := redis.NewClient(&redis.Options{
Addr: redisAddr,
Password: redisPassword,
DB: db,
})
// Test connection
ctx := context.Background()
_, err := client.Ping(ctx).Result()
if err != nil {
return nil, fmt.Errorf("failed to connect to Redis: %v", err)
}
return &RateLimiter{
redisClient: client,
configs: make(map[string]RateLimitConfig),
defaultConfig: RateLimitConfig{
Limit: DefaultRateLimit,
Window: DefaultRateLimitWindow,
BurstLimit: DefaultBurstLimit,
BlockDuration: DefaultBlockDuration,
},
}, nil
}
// AddConfig adds a rate limit configuration for a specific endpoint
func (rl *RateLimiter) AddConfig(config RateLimitConfig) {
key := fmt.Sprintf("%s:%s", config.Method, config.Path)
rl.configs[key] = config
}
// getConfig returns the rate limit configuration for a specific endpoint
func (rl *RateLimiter) getConfig(method, path string) RateLimitConfig {
key := fmt.Sprintf("%s:%s", method, path)
if config, ok := rl.configs[key]; ok {
return config
}
// Try with path pattern matching (simplified version)
for configKey, config := range rl.configs {
parts := strings.Split(configKey, ":")
if len(parts) != 2 {
continue
}
configMethod := parts[0]
configPath := parts[1]
// Skip if method doesn't match
if configMethod != method && configMethod != "*" {
continue
}
// Check if path matches pattern
if strings.Contains(configPath, "*") {
pattern := strings.Replace(configPath, "*", ".*", -1)
// In a real implementation, use proper regex matching
if strings.HasPrefix(path, strings.TrimSuffix(pattern, ".*")) {
return config
}
}
}
return rl.defaultConfig
}
// generateClientKey creates a composite key for identifying a client
func generateClientKey(identifier ClientIdentifier) string {
// Create a composite key using multiple identifiers
components := []string{
identifier.ClientIP,
identifier.APIKey,
identifier.XForwardedFor,
identifier.AccountID,
}
// Filter out empty components
var filteredComponents []string
for _, component := range components {
if component != "" {
filteredComponents = append(filteredComponents, component)
}
}
// If we have no components, use IP as fallback
if len(filteredComponents) == 0 {
return identifier.ClientIP
}
return strings.Join(filteredComponents, ":")
}
// generateRequestSignature creates a signature for the request to detect patterns
func generateRequestSignature(r *http.Request) string {
// In a real implementation, this would create a hash of request characteristics
// such as headers, query parameters, and payload structure
return fmt.Sprintf("%s:%s", r.Method, r.URL.Path)
}
// CheckRateLimit checks if a request exceeds the rate limit
func (rl *RateLimiter) CheckRateLimit(ctx context.Context, identifier ClientIdentifier, method, path string) (bool, error) {
config := rl.getConfig(method, path)
clientKey := generateClientKey(identifier)
// Check if client is blocked
blockedKey := fmt.Sprintf("blocked:%s", clientKey)
blocked, err := rl.redisClient.Exists(ctx, blockedKey).Result()
if err != nil {
return false, fmt.Errorf("failed to check if client is blocked: %v", err)
}
if blocked > 0 {
// Client is blocked
blockExpiration, err := rl.redisClient.TTL(ctx, blockedKey).Result()
if err != nil {
return false, fmt.Errorf("failed to get block expiration: %v", err)
}
log.Printf("Client %s is blocked for %v", clientKey, blockExpiration)
requestsBlocked.WithLabelValues(path, method, "client_blocked").Inc()
return false, nil
}
// Check rate limit
windowKey := fmt.Sprintf("ratelimit:%s:%s:%s:%d", clientKey, method, path, time.Now().Unix()/int64(config.Window.Seconds()))
// Increment counter
count, err := rl.redisClient.Incr(ctx, windowKey).Result()
if err != nil {
return false, fmt.Errorf("failed to increment rate limit counter: %v", err)
}
// Set expiration if this is a new key
if count == 1 {
rl.redisClient.Expire(ctx, windowKey, config.Window)
}
// Check if rate limit is exceeded
if count > int64(config.Limit) {
// Record rate limit exceeded event
rateLimitExceeded.WithLabelValues(path, method, clientKey).Inc()
// Increment suspicious activity counter
suspiciousKey := fmt.Sprintf("suspicious:%s", clientKey)
suspiciousCount, err := rl.redisClient.Incr(ctx, suspiciousKey).Result()
if err != nil {
log.Printf("Failed to increment suspicious activity counter: %v", err)
} else {
// Set expiration if this is a new key
if suspiciousCount == 1 {
rl.redisClient.Expire(ctx, suspiciousKey, 24*time.Hour)
}
// Check if client should be blocked
if suspiciousCount >= BlockThreshold {
// Block client
rl.redisClient.Set(ctx, blockedKey, "blocked", config.BlockDuration)
log.Printf("Client %s has been blocked for %v due to excessive rate limit violations", clientKey, config.BlockDuration)
clientBlockedCount.WithLabelValues("rate_limit_violations").Inc()
} else if suspiciousCount >= SuspiciousThreshold {
log.Printf("Client %s has suspicious activity (%d violations)", clientKey, suspiciousCount)
}
}
return false, nil
}
// Check for distributed attacks by analyzing patterns across clients
if config.SensitiveEndpoint {
// Track request signature
signatureKey := fmt.Sprintf("signature:%s:%d", identifier.RequestSignature, time.Now().Unix()/60)
signatureCount, err := rl.redisClient.Incr(ctx, signatureKey).Result()
if err != nil {
log.Printf("Failed to track request signature: %v", err)
} else {
// Set expiration if this is a new key
if signatureCount == 1 {
rl.redisClient.Expire(ctx, signatureKey, 10*time.Minute)
}
// Check for distributed attack patterns
if signatureCount > int64(config.Limit*3) {
log.Printf("Potential distributed attack detected with signature %s (%d requests)", identifier.RequestSignature, signatureCount)
requestsBlocked.WithLabelValues(path, method, "distributed_attack").Inc()
return false, nil
}
}
}
return true, nil
}
// RateLimitMiddleware is a middleware that applies rate limiting
func (rl *RateLimiter) RateLimitMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
ctx := r.Context()
// Extract client identifiers
identifier := ClientIdentifier{
ClientIP: r.RemoteAddr,
APIKey: r.Header.Get("X-API-Key"),
UserAgent: r.Header.Get("User-Agent"),
XForwardedFor: r.Header.Get("X-Forwarded-For"),
SessionID: r.Header.Get("X-Session-ID"),
AccountID: r.Header.Get("X-Account-ID"),
RequestSignature: generateRequestSignature(r),
}
// Check rate limit
allowed, err := rl.CheckRateLimit(ctx, identifier, r.Method, r.URL.Path)
if err != nil {
log.Printf("Rate limit check failed: %v", err)
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
requestsTotal.WithLabelValues(r.URL.Path, r.Method, strconv.Itoa(http.StatusInternalServerError)).Inc()
return
}
if !allowed {
http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
requestsTotal.WithLabelValues(r.URL.Path, r.Method, strconv.Itoa(http.StatusTooManyRequests)).Inc()
return
}
// Call the next handler
next.ServeHTTP(w, r)
// Record metrics
duration := time.Since(start).Seconds()
requestLatency.WithLabelValues(r.URL.Path, r.Method).Observe(duration)
})
}
func main() {
// Initialize rate limiter
redisAddr := os.Getenv("REDIS_ADDR")
if redisAddr == "" {
redisAddr = "localhost:6379"
}
redisPassword := os.Getenv("REDIS_PASSWORD")
redisDB := 0
if dbStr := os.Getenv("REDIS_DB"); dbStr != "" {
var err error
redisDB, err = strconv.Atoi(dbStr)
if err != nil {
log.Fatalf("Invalid REDIS_DB value: %v", err)
}
}
rateLimiter, err := NewRateLimiter(redisAddr, redisPassword, redisDB)
if err != nil {
log.Fatalf("Failed to initialize rate limiter: %v", err)
}
// Configure rate limits for different endpoints
rateLimiter.AddConfig(RateLimitConfig{
Path: "/api/v1/payments",
Method: "POST",
Limit: 10,
Window: time.Minute,
BurstLimit: 2,
BlockDuration: 2 * time.Hour,
SensitiveEndpoint: true,
})
rateLimiter.AddConfig(RateLimitConfig{
Path: "/api/v1/accounts/*",
Method: "GET",
Limit: 100,
Window: time.Minute,
BurstLimit: 10,
BlockDuration: 1 * time.Hour,
SensitiveEndpoint: true,
})
rateLimiter.AddConfig(RateLimitConfig{
Path: "/api/v1/products",
Method: "GET",
Limit: 300,
Window: time.Minute,
BurstLimit: 30,
BlockDuration: 30 * time.Minute,
SensitiveEndpoint: false,
})
// Create router
router := mux.NewRouter()
// Add metrics endpoint
router.Path("/metrics").Handler(promhttp.Handler())
// Add API endpoints
apiRouter := router.PathPrefix("/api/v1").Subrouter()
apiRouter.Use(rateLimiter.RateLimitMiddleware)
// Example API endpoints
apiRouter.HandleFunc("/payments", func(w http.ResponseWriter, r *http.Request) {
// Process payment
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"status": "success"})
requestsTotal.WithLabelValues(r.URL.Path, r.Method, "200").Inc()
}).Methods("POST")
apiRouter.HandleFunc("/accounts/{id}", func(w http.ResponseWriter, r *http.Request) {
// Get account details
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"account_id": "123", "status": "active"})
requestsTotal.WithLabelValues(r.URL.Path, r.Method, "200").Inc()
}).Methods("GET")
apiRouter.HandleFunc("/products", func(w http.ResponseWriter, r *http.Request) {
// Get products
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode([]map[string]string{
{"id": "1", "name": "Product 1"},
{"id": "2", "name": "Product 2"},
})
requestsTotal.WithLabelValues(r.URL.Path, r.Method, "200").Inc()
}).Methods("GET")
// Start server
port := os.Getenv("PORT")
if port == "" {
port = "8080"
}
log.Printf("Starting server on :%s", port)
log.Fatal(http.ListenAndServe(":"+port, router))
}
• Implemented a comprehensive API security monitoring dashboard:
// api_security_dashboard.ts
import React, { useState, useEffect } from 'react';
import {
LineChart, Line, BarChart, Bar, PieChart, Pie,
XAxis, YAxis, CartesianGrid, Tooltip, Legend,
ResponsiveContainer, Cell
} from 'recharts';
import {
Card, CardContent, Typography, Grid,
Select, MenuItem, FormControl, InputLabel,
Button, Tabs, Tab, Box, Table, TableBody,
TableCell, TableContainer, TableHead, TableRow,
Paper, Chip
} from '@material-ui/core';
import { DatePicker } from '@material-ui/pickers';
interface ApiMetrics {
timestamp: string;
requestsTotal: number;
requestsBlocked: number;
rateLimitExceeded: number;
averageLatency: number;
p95Latency: number;
p99Latency: number;
clientsBlocked: number;
suspiciousActivities: number;
}
interface EndpointMetrics {
path: string;
method: string;
requestsTotal: number;
requestsBlocked: number;
rateLimitExceeded: number;
averageLatency: number;
errorRate: number;
}
interface BlockedClient {
clientId: string;
blockedSince: string;
blockedUntil: string;
reason: string;
violationCount: number;
endpoints: string[];
}
interface SecurityEvent {
timestamp: string;
eventType: string;
clientId: string;
path: string;
method: string;
description: string;
severity: 'low' | 'medium' | 'high' | 'critical';
}
const COLORS = ['#0088FE', '#00C49F', '#FFBB28', '#FF8042', '#8884D8'];
const ApiSecurityDashboard: React.FC = () => {
const [timeRange, setTimeRange] = useState<string>('24h');
const [startDate, setStartDate] = useState<Date | null>(null);
const [endDate, setEndDate] = useState<Date | null>(null);
const [apiMetrics, setApiMetrics] = useState<ApiMetrics[]>([]);
const [endpointMetrics, setEndpointMetrics] = useState<EndpointMetrics[]>([]);
const [blockedClients, setBlockedClients] = useState<BlockedClient[]>([]);
const [securityEvents, setSecurityEvents] = useState<SecurityEvent[]>([]);
const [tabValue, setTabValue] = useState(0);
useEffect(() => {
// Fetch initial data
fetchMetrics();
}, []);
useEffect(() => {
// Fetch data when filters change
fetchMetrics();
}, [timeRange, startDate, endDate]);
const fetchMetrics = async () => {
// In a real implementation, this would call an API with the selected filters
// For this example, we'll use mock data
// Mock API metrics
const mockApiMetrics: ApiMetrics[] = Array.from({ length: 24 }, (_, i) => {
const date = new Date();
date.setHours(date.getHours() - (23 - i));
return {
timestamp: date.toISOString(),
requestsTotal: 1000 + Math.floor(Math.random() * 500),
requestsBlocked: Math.floor(Math.random() * 50),
rateLimitExceeded: Math.floor(Math.random() * 30),
averageLatency: 50 + Math.random() * 30,
p95Latency: 100 + Math.random() * 50,
p99Latency: 200 + Math.random() * 100,
clientsBlocked: Math.floor(Math.random() * 5),
suspiciousActivities: Math.floor(Math.random() * 10)
};
});
// Mock endpoint metrics
const mockEndpointMetrics: EndpointMetrics[] = [
{
path: "/api/v1/payments",
method: "POST",
requestsTotal: 5432,
requestsBlocked: 123,
rateLimitExceeded: 87,
averageLatency: 78.5,
errorRate: 2.3
},
{
path: "/api/v1/accounts/{id}",
method: "GET",
requestsTotal: 12543,
requestsBlocked: 234,
rateLimitExceeded: 156,
averageLatency: 45.2,
errorRate: 1.8
},
{
path: "/api/v1/products",
method: "GET",
requestsTotal: 28765,
requestsBlocked: 89,
rateLimitExceeded: 45,
averageLatency: 32.1,
errorRate: 0.7
},
{
path: "/api/v1/orders",
method: "POST",
requestsTotal: 3421,
requestsBlocked: 67,
rateLimitExceeded: 42,
averageLatency: 85.3,
errorRate: 1.5
},
{
path: "/api/v1/users/{id}",
method: "GET",
requestsTotal: 8765,
requestsBlocked: 45,
rateLimitExceeded: 23,
averageLatency: 38.7,
errorRate: 0.9
}
];
// Mock blocked clients
const mockBlockedClients: BlockedClient[] = [
{
clientId: "192.168.1.100:api_key_123",
blockedSince: "2023-05-01T10:23:45Z",
blockedUntil: "2023-05-01T12:23:45Z",
reason: "Rate limit exceeded",
violationCount: 12,
endpoints: ["/api/v1/payments", "/api/v1/accounts/{id}"]
},
{
clientId: "192.168.1.101:api_key_456",
blockedSince: "2023-05-01T11:15:22Z",
blockedUntil: "2023-05-01T13:15:22Z",
reason: "Suspicious activity",
violationCount: 8,
endpoints: ["/api/v1/payments"]
},
{
clientId: "192.168.1.102:api_key_789",
blockedSince: "2023-05-01T09:45:12Z",
blockedUntil: "2023-05-01T11:45:12Z",
reason: "Distributed attack",
violationCount: 15,
endpoints: ["/api/v1/accounts/{id}", "/api/v1/users/{id}"]
}
];
// Mock security events
const mockSecurityEvents: SecurityEvent[] = [
{
timestamp: "2023-05-01T10:23:45Z",
eventType: "RATE_LIMIT_EXCEEDED",
clientId: "192.168.1.100:api_key_123",
path: "/api/v1/payments",
method: "POST",
description: "Client exceeded rate limit (10 requests/minute)",
severity: "medium"
},
{
timestamp: "2023-05-01T11:15:22Z",
eventType: "SUSPICIOUS_ACTIVITY",
clientId: "192.168.1.101:api_key_456",
path: "/api/v1/payments",
method: "POST",
description: "Multiple failed payment attempts with invalid data",
severity: "high"
},
{
timestamp: "2023-05-01T09:45:12Z",
eventType: "DISTRIBUTED_ATTACK",
clientId: "192.168.1.102:api_key_789",
path: "/api/v1/accounts/{id}",
method: "GET",
description: "Distributed attack detected from multiple IPs with same request pattern",
severity: "critical"
},
{
timestamp: "2023-05-01T08:32:18Z",
eventType: "INVALID_API_KEY",
clientId: "192.168.1.103",
path: "/api/v1/orders",
method: "POST",
description: "Multiple requests with invalid API keys",
severity: "low"
},
{
timestamp: "2023-05-01T12:05:33Z",
eventType: "PAYLOAD_ATTACK",
clientId: "192.168.1.104:api_key_321",
path: "/api/v1/users/{id}",
method: "PUT",
description: "Potential SQL injection attempt in request payload",
severity: "critical"
}
];
setApiMetrics(mockApiMetrics);
setEndpointMetrics(mockEndpointMetrics);
setBlockedClients(mockBlockedClients);
setSecurityEvents(mockSecurityEvents);
};
const handleTimeRangeChange = (event: React.ChangeEvent<{ value: unknown }>) => {
setTimeRange(event.target.value as string);
};
const handleTabChange = (event: React.ChangeEvent<{}>, newValue: number) => {
setTabValue(newValue);
};
const renderOverviewTab = () => (
<Grid container spacing={3}>
<Grid item xs={12} md={6}>
<Card>
<CardContent>
<Typography variant="h6">API Requests</Typography>
<ResponsiveContainer width="100%" height={300}>
<LineChart data={apiMetrics}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis
dataKey="timestamp"
tickFormatter={(timestamp) => new Date(timestamp).toLocaleTimeString()}
/>
<YAxis />
<Tooltip
labelFormatter={(timestamp) => new Date(timestamp).toLocaleString()}
/>
<Legend />
<Line type="monotone" dataKey="requestsTotal" name="Total Requests" stroke="#8884d8" />
<Line type="monotone" dataKey="requestsBlocked" name="Blocked Requests" stroke="#ff8042" />
</LineChart>
</ResponsiveContainer>
</CardContent>
</Card>
</Grid>
<Grid item xs={12} md={6}>
<Card>
<CardContent>
<Typography variant="h6">API Latency</Typography>
<ResponsiveContainer width="100%" height={300}>
<LineChart data={apiMetrics}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis
dataKey="timestamp"
tickFormatter={(timestamp) => new Date(timestamp).toLocaleTimeString()}
/>
<YAxis />
<Tooltip
labelFormatter={(timestamp) => new Date(timestamp).toLocaleString()}
/>
<Legend />
<Line type="monotone" dataKey="averageLatency" name="Avg Latency (ms)" stroke="#8884d8" />
<Line type="monotone" dataKey="p95Latency" name="P95 Latency (ms)" stroke="#82ca9d" />
<Line type="monotone" dataKey="p99Latency" name="P99 Latency (ms)" stroke="#ff8042" />
</LineChart>
</ResponsiveContainer>
</CardContent>
</Card>
</Grid>
<Grid item xs={12} md={6}>
<Card>
<CardContent>
<Typography variant="h6">Rate Limiting</Typography>
<ResponsiveContainer width="100%" height={300}>
<LineChart data={apiMetrics}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis
dataKey="timestamp"
tickFormatter={(timestamp) => new Date(timestamp).toLocaleTimeString()}
/>
<YAxis />
<Tooltip
labelFormatter={(timestamp) => new Date(timestamp).toLocaleString()}
/>
<Legend />
<Line type="monotone" dataKey="rateLimitExceeded" name="Rate Limit Exceeded" stroke="#8884d8" />
<Line type="monotone" dataKey="clientsBlocked" name="Clients Blocked" stroke="#ff8042" />
</LineChart>
</ResponsiveContainer>
</CardContent>
</Card>
</Grid>
<Grid item xs={12} md={6}>
<Card>
<CardContent>
<Typography variant="h6">Top Endpoints by Traffic</Typography>
<ResponsiveContainer width="100%" height={300}>
<BarChart data={endpointMetrics}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="path" />
<YAxis />
<Tooltip />
<Legend />
<Bar dataKey="requestsTotal" name="Total Requests" fill="#8884d8" />
</BarChart>
</ResponsiveContainer>
</CardContent>
</Card>
</Grid>
</Grid>
);
const renderEndpointsTab = () => (
<Grid container spacing={3}>
<Grid item xs={12}>
<TableContainer component={Paper}>
<Table>
<TableHead>
<TableRow>
<TableCell>Endpoint</TableCell>
<TableCell>Method</TableCell>
<TableCell align="right">Total Requests</TableCell>
<TableCell align="right">Blocked Requests</TableCell>
<TableCell align="right">Rate Limit Exceeded</TableCell>
<TableCell align="right">Avg Latency (ms)</TableCell>
<TableCell align="right">Error Rate (%)</TableCell>
</TableRow>
</TableHead>
<TableBody>
{endpointMetrics.map((endpoint) => (
<TableRow key={`${endpoint.method}-${endpoint.path}`}>
<TableCell>{endpoint.path}</TableCell>
<TableCell>
<Chip
label={endpoint.method}
color={
endpoint.method === "GET" ? "primary" :
endpoint.method === "POST" ? "secondary" :
"default"
}
size="small"
/>
</TableCell>
<TableCell align="right">{endpoint.requestsTotal.toLocaleString()}</TableCell>
<TableCell align="right">{endpoint.requestsBlocked.toLocaleString()}</TableCell>
<TableCell align="right">{endpoint.rateLimitExceeded.toLocaleString()}</TableCell>
<TableCell align="right">{endpoint.averageLatency.toFixed(1)}</TableCell>
<TableCell align="right">{endpoint.errorRate.toFixed(1)}%</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</TableContainer>
</Grid>
<Grid item xs={12} md={6}>
<Card>
<CardContent>
<Typography variant="h6">Blocked Requests by Endpoint</Typography>
<ResponsiveContainer width="100%" height={300}>
<BarChart data={endpointMetrics}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="path" />
<YAxis />
<Tooltip />
<Legend />
<Bar dataKey="requestsBlocked" name="Blocked Requests" fill="#ff8042" />
</BarChart>
</ResponsiveContainer>
</CardContent>
</Card>
</Grid>
<Grid item xs={12} md={6}>
<Card>
<CardContent>
<Typography variant="h6">Rate Limit Exceeded by Endpoint</Typography>
<ResponsiveContainer width="100%" height={300}>
<BarChart data={endpointMetrics}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="path" />
<YAxis />
<Tooltip />
<Legend />
<Bar dataKey="rateLimitExceeded" name="Rate Limit Exceeded" fill="#8884d8" />
</BarChart>
</ResponsiveContainer>
</CardContent>
</Card>
</Grid>
</Grid>
);
const renderSecurityTab = () => (
<Grid container spacing={3}>
<Grid item xs={12}>
<Typography variant="h6">Blocked Clients</Typography>
<TableContainer component={Paper}>
<Table>
<TableHead>
<TableRow>
<TableCell>Client ID</TableCell>
<TableCell>Blocked Since</TableCell>
<TableCell>Blocked Until</TableCell>
<TableCell>Reason</TableCell>
<TableCell align="right">Violation Count</TableCell>
<TableCell>Affected Endpoints</TableCell>
</TableRow>
</TableHead>
<TableBody>
{blockedClients.map((client) => (
<TableRow key={client.clientId}>
<TableCell>{client.clientId}</TableCell>
<TableCell>{new Date(client.blockedSince).toLocaleString()}</TableCell>
<TableCell>{new Date(client.blockedUntil).toLocaleString()}</TableCell>
<TableCell>{client.reason}</TableCell>
<TableCell align="right">{client.violationCount}</TableCell>
<TableCell>
{client.endpoints.map((endpoint) => (
<Chip
key={endpoint}
label={endpoint}
size="small"
style={{ margin: 2 }}
/>
))}
</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</TableContainer>
</Grid>
<Grid item xs={12} style={{ marginTop: 20 }}>
<Typography variant="h6">Security Events</Typography>
<TableContainer component={Paper}>
<Table>
<TableHead>
<TableRow>
<TableCell>Timestamp</TableCell>
<TableCell>Event Type</TableCell>
<TableCell>Client ID</TableCell>
<TableCell>Endpoint</TableCell>
<TableCell>Method</TableCell>
<TableCell>Description</TableCell>
<TableCell>Severity</TableCell>
</TableRow>
</TableHead>
<TableBody>
{securityEvents.map((event, index) => (
<TableRow key={index}>
<TableCell>{new Date(event.timestamp).toLocaleString()}</TableCell>
<TableCell>{event.eventType}</TableCell>
<TableCell>{event.clientId}</TableCell>
<TableCell>{event.path}</TableCell>
<TableCell>
<Chip
label={event.method}
color={
event.method === "GET" ? "primary" :
event.method === "POST" ? "secondary" :
"default"
}
size="small"
/>
</TableCell>
<TableCell>{event.description}</TableCell>
<TableCell>
<Chip
label={event.severity}
color={
event.severity === "low" ? "default" :
event.severity === "medium" ? "primary" :
event.severity === "high" ? "secondary" :
"error"
}
size="small"
/>
</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</TableContainer>
</Grid>
</Grid>
);
return (
<div>
<Typography variant="h4" gutterBottom>
API Security Dashboard
</Typography>
<Grid container spacing={3} style={{ marginBottom: 20 }}>
<Grid item xs={12} md={3}>
<FormControl fullWidth>
<InputLabel>Time Range</InputLabel>
<Select value={timeRange} onChange={handleTimeRangeChange}>
<MenuItem value="1h">Last Hour</MenuItem>
<MenuItem value="6h">Last 6 Hours</MenuItem>
<MenuItem value="24h">Last 24 Hours</MenuItem>
<MenuItem value="7d">Last 7 Days</MenuItem>
<MenuItem value="30d">Last 30 Days</MenuItem>
<MenuItem value="custom">Custom Range</MenuItem>
</Select>
</FormControl>
</Grid>
{timeRange === 'custom' && (
<>
<Grid item xs={12} md={3}>
<DatePicker
label="Start Date"
value={startDate}
onChange={setStartDate}
renderInput={(props) => <TextField {...props} fullWidth />}
/>
</Grid>
<Grid item xs={12} md={3}>
<DatePicker
label="End Date"
value={endDate}
onChange={setEndDate}
renderInput={(props) => <TextField {...props} fullWidth />}
/>
</Grid>
</>
)}
<Grid item xs={12} md={3}>
<Button variant="contained" color="primary" fullWidth onClick={fetchMetrics}>
Refresh Data
</Button>
</Grid>
</Grid>
<Tabs value={tabValue} onChange={handleTabChange} aria-label="api security tabs">
<Tab label="Overview" />
<Tab label="Endpoints" />
<Tab label="Security" />
</Tabs>
<Box mt={3}>
{tabValue === 0 && renderOverviewTab()}
{tabValue === 1 && renderEndpointsTab()}
{tabValue === 2 && renderSecurityTab()}
</Box>
</div>
);
};
export default ApiSecurityDashboard;
Lessons Learned:
Effective API security requires a multi-layered approach beyond simple rate limiting.
How to Avoid:
Implement rate limiting based on multiple client identifiers.
Use distributed rate limiting with a shared data store.
Configure different rate limits for different endpoints based on sensitivity.
Implement request validation to prevent malformed requests.
Monitor for distributed attack patterns across multiple clients.
No summary provided
What Happened:
During a marketing campaign, the company's backend services experienced severe performance degradation despite having rate limiting configured in the API gateway. The incident began when response times for all API endpoints increased dramatically, eventually leading to 503 errors for many users. The operations team initially suspected a DDoS attack but later discovered that legitimate traffic from specific clients was overwhelming the backend services.
Diagnosis Steps:
Analyzed API gateway logs to identify traffic patterns.
Examined rate limiting configurations across all routes.
Reviewed client request headers and authentication methods.
Monitored backend service resource utilization.
Tested rate limiting with different client configurations.
Root Cause:
The investigation revealed multiple issues: 1. Rate limiting was configured based only on client IP addresses 2. Many users were accessing the API through corporate proxies, appearing as a single IP 3. The API gateway was not configured to use API keys or tokens for rate limiting 4. Some internal services were whitelisted from rate limiting entirely 5. The rate limiting plugin configuration had inconsistencies across different routes
Fix/Workaround:
• Short-term: Implemented emergency rate limiting based on both IP and authentication tokens
• Adjusted backend service scaling parameters to handle the increased load
• Created a comprehensive Kong API Gateway configuration with proper rate limiting:
# kong.yaml - Proper rate limiting configuration
_format_version: "2.1"
_transform: true
services:
- name: user-service
url: http://user-service.internal:8080
plugins:
- name: rate-limiting
config:
minute: 60
hour: 1000
day: 10000
policy: redis
redis_host: redis.internal
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
identifier: consumer
sync_rate: -1
namespace: user-service
routes:
- name: user-api
paths:
- /api/v1/users
strip_path: false
preserve_host: true
protocols:
- http
- https
- name: product-service
url: http://product-service.internal:8080
plugins:
- name: rate-limiting
config:
minute: 120
hour: 2000
day: 20000
policy: redis
redis_host: redis.internal
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
identifier: consumer
sync_rate: -1
namespace: product-service
routes:
- name: product-api
paths:
- /api/v1/products
strip_path: false
preserve_host: true
protocols:
- http
- https
consumers:
- username: mobile-app
custom_id: mobile-app-client
plugins:
- name: rate-limiting
config:
minute: 30
hour: 500
day: 5000
policy: redis
redis_host: redis.internal
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
- username: web-app
custom_id: web-app-client
plugins:
- name: rate-limiting
config:
minute: 60
hour: 1000
day: 10000
policy: redis
redis_host: redis.internal
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
- username: partner-api
custom_id: partner-api-client
plugins:
- name: rate-limiting
config:
minute: 120
hour: 2000
day: 20000
policy: redis
redis_host: redis.internal
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
- username: internal-service
custom_id: internal-service-client
plugins:
- name: rate-limiting
config:
minute: 600
hour: 10000
day: 100000
policy: redis
redis_host: redis.internal
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
plugins:
- name: key-auth
config:
key_names:
- apikey
hide_credentials: true
- name: cors
config:
origins:
- "*"
methods:
- GET
- POST
- PUT
- DELETE
- OPTIONS
headers:
- Accept
- Accept-Version
- Content-Length
- Content-MD5
- Content-Type
- Date
- X-Auth-Token
exposed_headers:
- X-Auth-Token
credentials: true
max_age: 3600
preflight_continue: false
- name: prometheus
config:
status_code_metrics: true
latency_metrics: true
upstream_health_metrics: true
bandwidth_metrics: true
- name: request-transformer
config:
add:
headers:
- X-Request-ID:$(uuid)
• Implemented a custom rate limiting plugin in Lua for more advanced scenarios:
-- advanced-rate-limiting.lua
local redis = require "resty.redis"
local cjson = require "cjson"
local timestamp = require "kong.tools.timestamp"
local counter = require "kong.tools.counter"
local kong = kong
local AdvanacedRateLimiting = {}
AdvanacedRateLimiting.PRIORITY = 901
AdvanacedRateLimiting.VERSION = "1.0.0"
local EMPTY = {}
local EXPIRATIONS = {
second = 1,
minute = 60,
hour = 3600,
day = 86400,
month = 2592000,
year = 31536000,
}
local function get_identifier(conf)
local identifier
-- Use consumer id if available
if conf.identifier == "consumer" then
identifier = (kong.client.get_consumer() or EMPTY).id
if not identifier and conf.fallback_to_ip then
identifier = kong.client.get_forwarded_ip()
end
-- Use credential id if available
elseif conf.identifier == "credential" then
local credential = kong.client.get_credential()
identifier = credential and credential.id
if not identifier and conf.fallback_to_ip then
identifier = kong.client.get_forwarded_ip()
end
-- Use custom header if specified
elseif conf.identifier == "header" then
identifier = kong.request.get_header(conf.header_name)
if not identifier and conf.fallback_to_ip then
identifier = kong.client.get_forwarded_ip()
end
-- Default to IP address
else
identifier = kong.client.get_forwarded_ip()
end
return identifier
end
local function get_usage(conf, identifier, current_timestamp, limits)
local usage = {}
local stop_on_error = conf.fault_tolerant ~= true
-- Connect to Redis
local red = redis:new()
red:set_timeout(conf.redis_timeout)
local ok, err = red:connect(conf.redis_host, conf.redis_port)
if not ok then
kong.log.err("failed to connect to Redis: ", err)
return nil, err
end
if conf.redis_password and conf.redis_password ~= "" then
local ok, err = red:auth(conf.redis_password)
if not ok then
kong.log.err("failed to authenticate with Redis: ", err)
return nil, err
end
end
if conf.redis_database ~= 0 then
local ok, err = red:select(conf.redis_database)
if not ok then
kong.log.err("failed to select Redis database: ", err)
return nil, err
end
end
-- Check each limit
for period, limit in pairs(limits) do
local cache_key = "ratelimit:" .. identifier .. ":" .. period .. ":" .. conf.namespace
local current_usage, err = red:get(cache_key)
if err then
kong.log.err("failed to get current usage: ", err)
if stop_on_error then
return nil, err
end
usage[period] = {limit = limit, remaining = 0}
end
-- If no usage found, initialize it
if not current_usage then
current_usage = 0
end
-- Calculate remaining
local remaining = math.max(0, limit - tonumber(current_usage))
-- Add to usage table
usage[period] = {
limit = limit,
remaining = remaining,
usage = tonumber(current_usage),
}
end
-- Put Redis connection back to pool
local ok, err = red:set_keepalive(10000, 100)
if not ok then
kong.log.err("failed to set Redis keepalive: ", err)
end
return usage
end
local function increment_usage(conf, identifier, current_timestamp, limits, delta)
local stop_on_error = conf.fault_tolerant ~= true
-- Connect to Redis
local red = redis:new()
red:set_timeout(conf.redis_timeout)
local ok, err = red:connect(conf.redis_host, conf.redis_port)
if not ok then
kong.log.err("failed to connect to Redis: ", err)
return nil, err
end
if conf.redis_password and conf.redis_password ~= "" then
local ok, err = red:auth(conf.redis_password)
if not ok then
kong.log.err("failed to authenticate with Redis: ", err)
return nil, err
end
end
if conf.redis_database ~= 0 then
local ok, err = red:select(conf.redis_database)
if not ok then
kong.log.err("failed to select Redis database: ", err)
return nil, err
end
end
-- Start Redis pipeline
red:init_pipeline()
-- Increment each limit
for period, limit in pairs(limits) do
local cache_key = "ratelimit:" .. identifier .. ":" .. period .. ":" .. conf.namespace
local expiration = EXPIRATIONS[period]
red:incrby(cache_key, delta)
red:expire(cache_key, expiration)
end
-- Execute pipeline
local _, err = red:commit_pipeline()
if err then
kong.log.err("failed to commit Redis pipeline: ", err)
if stop_on_error then
return nil, err
end
end
-- Put Redis connection back to pool
local ok, err = red:set_keepalive(10000, 100)
if not ok then
kong.log.err("failed to set Redis keepalive: ", err)
end
return true
end
function AdvanacedRateLimiting:access(conf)
-- Get current timestamp
local current_timestamp = timestamp.get_utc()
-- Get identifier based on configuration
local identifier = get_identifier(conf)
if not identifier then
kong.log.err("cannot identify the client, rate limiting skipped")
return
end
-- Get request path and method
local path = kong.request.get_path()
local method = kong.request.get_method()
-- Get request size
local request_size = tonumber(kong.request.get_header("content-length")) or 0
-- Calculate rate limiting weight based on request size
local weight = 1
if conf.weight_by_size and request_size > 0 then
weight = math.ceil(request_size / 1024) -- 1 unit per KB
end
-- Apply path-specific limits if configured
local limits = {}
local path_matched = false
if conf.path_limits then
for _, path_limit in ipairs(conf.path_limits) do
if path:match(path_limit.path) and (path_limit.method == "*" or path_limit.method == method) then
limits = path_limit.limits
path_matched = true
break
end
end
end
-- Fall back to default limits if no path match
if not path_matched then
limits = {
second = conf.second,
minute = conf.minute,
hour = conf.hour,
day = conf.day,
month = conf.month,
year = conf.year,
}
end
-- Remove empty limits
for k, v in pairs(limits) do
if not v or v == 0 then
limits[k] = nil
end
end
-- Check if any limits are defined
if not next(limits) then
return
end
-- Get current usage
local usage, err = get_usage(conf, identifier, current_timestamp, limits)
if err then
if conf.fault_tolerant then
kong.log.err("error getting usage: ", err)
return
else
return kong.response.error(500, "Internal Server Error")
end
end
-- Check if any limit is exceeded
local stop_now = false
for period, limit in pairs(limits) do
if usage[period].remaining <= 0 then
stop_now = true
break
end
end
-- If limit exceeded, return 429
if stop_now then
-- Add headers
if not conf.hide_client_headers then
for period, limit in pairs(limits) do
kong.response.set_header("X-RateLimit-Limit-" .. period, limit)
kong.response.set_header("X-RateLimit-Remaining-" .. period, usage[period].remaining)
end
if conf.retry_after_jitter_max > 0 then
local retry_after = math.random(1, conf.retry_after_jitter_max)
kong.response.set_header("Retry-After", retry_after)
end
end
return kong.response.error(429, "API rate limit exceeded")
end
-- Increment usage
local ok, err = increment_usage(conf, identifier, current_timestamp, limits, weight)
if not ok then
if conf.fault_tolerant then
kong.log.err("error incrementing usage: ", err)
else
return kong.response.error(500, "Internal Server Error")
end
end
-- Add headers
if not conf.hide_client_headers then
for period, limit in pairs(limits) do
kong.response.set_header("X-RateLimit-Limit-" .. period, limit)
kong.response.set_header("X-RateLimit-Remaining-" .. period, math.max(0, usage[period].remaining - weight))
end
end
end
return AdvanacedRateLimiting
• Long-term: Implemented a comprehensive API management strategy:
- Created a multi-layer rate limiting approach (global, service, route, consumer)
- Implemented token-based authentication for all API clients
- Deployed distributed rate limiting with Redis cluster
- Added circuit breakers to prevent cascading failures
- Implemented real-time monitoring and alerting for API traffic patterns
Lessons Learned:
Effective API rate limiting requires a multi-dimensional approach beyond simple IP-based throttling.
How to Avoid:
Implement rate limiting based on multiple identifiers (IP, token, consumer).
Use distributed rate limiting with proper storage backends.
Test rate limiting with realistic traffic patterns including proxy scenarios.
Monitor and alert on unusual traffic patterns before they cause issues.
Implement circuit breakers to protect backend services.
No summary provided
What Happened:
A company launched a major marketing campaign that drove significant traffic to their APIs. Despite capacity planning for the increased load, several critical services became unresponsive. Users reported timeouts and error responses, while backend services showed minimal resource utilization. The issue persisted despite scaling up backend services, suggesting a bottleneck elsewhere in the system.
Diagnosis Steps:
Analyzed API gateway logs and metrics during the incident.
Reviewed recent configuration changes to the API gateway.
Examined rate limiting policies across different services.
Tested API calls with different authentication credentials.
Compared gateway configuration across environments.
Root Cause:
The investigation revealed multiple issues with the API gateway configuration: 1. Global rate limiting was configured too aggressively at 100 requests per minute per IP 2. Rate limiting was applied at the wrong level (IP-based instead of token-based) 3. The rate limiting plugin was configured with a "redis" policy but the Redis cluster was undersized 4. Marketing campaign traffic was not exempted from rate limiting 5. Rate limiting headers were not being returned to clients, preventing proper backoff
Fix/Workaround:
• Short-term: Implemented immediate configuration fixes in Kong:
# Before: Problematic Kong rate limiting configuration
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: global-rate-limiting
namespace: api-gateway
config:
minute: 100
limit_by: ip
policy: redis
redis_host: redis-master
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: true
plugin: rate-limiting
# After: Improved Kong rate limiting configuration
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: global-rate-limiting
namespace: api-gateway
config:
minute: 300
limit_by: credential
policy: redis
redis_host: redis-master
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
plugin: rate-limiting
---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: marketing-rate-limiting
namespace: api-gateway
config:
minute: 1000
limit_by: credential
policy: redis
redis_host: redis-master
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
plugin: rate-limiting
• Implemented service-specific rate limiting with proper consumer segmentation:
# Service-specific rate limiting
apiVersion: configuration.konghq.com/v1
kind: KongIngress
metadata:
name: payment-service-config
config:
plugins:
- name: payment-rate-limiting
config:
minute: 200
limit_by: credential
policy: redis
---
apiVersion: configuration.konghq.com/v1
kind: KongConsumer
metadata:
name: marketing-api-consumer
annotations:
kubernetes.io/ingress.class: kong
username: marketing-api
credentials:
- marketing-api-key
---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: consumer-specific-rate-limiting
config:
minute: 1000
limit_by: credential
policy: redis
redis_host: redis-master
redis_port: 6379
redis_timeout: 2000
redis_database: 0
hide_client_headers: false
plugin: rate-limiting
• Implemented a custom rate limiting plugin in Lua for advanced use cases:
-- custom-rate-limiting.lua
local redis = require "resty.redis"
local timestamp = require "kong.tools.timestamp"
local policy_cluster = require "kong.plugins.rate-limiting.policies.cluster"
local kong = kong
local ngx = ngx
local max = math.max
local floor = math.floor
local EMPTY = {}
local EXPIRATION = 60 * 60 -- 1 hour in seconds
local CustomRateLimiting = {}
CustomRateLimiting.PRIORITY = 901
CustomRateLimiting.VERSION = "1.0.0"
local function get_identifier(conf)
local identifier
if conf.limit_by == "credential" then
identifier = (kong.client.get_credential() or EMPTY).id
elseif conf.limit_by == "consumer" then
identifier = (kong.client.get_consumer() or EMPTY).id
elseif conf.limit_by == "ip" then
identifier = kong.client.get_forwarded_ip()
elseif conf.limit_by == "service" then
identifier = (kong.router.get_service() or EMPTY).id
elseif conf.limit_by == "header" then
identifier = kong.request.get_header(conf.header_name)
elseif conf.limit_by == "path" then
identifier = kong.request.get_path()
end
return identifier or kong.client.get_forwarded_ip()
end
local function get_usage(conf, identifier, current_timestamp, limits)
local usage = {}
local stop
-- Custom business logic for rate limiting
if conf.business_tier == "premium" then
-- Premium tier gets higher limits
for k, v in pairs(limits) do
limits[k] = v * 2
end
elseif conf.business_tier == "marketing" then
-- Marketing campaigns get even higher limits
for k, v in pairs(limits) do
limits[k] = v * 5
end
end
-- Use the policy defined in the configuration
if conf.policy == "redis" then
local red = redis:new()
red:set_timeout(conf.redis_timeout)
local ok, err = red:connect(conf.redis_host, conf.redis_port)
if not ok then
kong.log.err("failed to connect to Redis: ", err)
return nil, nil, err
end
if conf.redis_password and conf.redis_password ~= "" then
local ok, err = red:auth(conf.redis_password)
if not ok then
kong.log.err("failed to authenticate with Redis: ", err)
return nil, nil, err
end
end
if conf.redis_database ~= 0 then
local ok, err = red:select(conf.redis_database)
if not ok then
kong.log.err("failed to change Redis database: ", err)
return nil, nil, err
end
end
local keys = {}
for period, limit in pairs(limits) do
table.insert(keys, "ratelimit:" .. identifier .. ":" .. period .. ":" .. conf.service_id)
end
red:init_pipeline()
for _, key in ipairs(keys) do
red:get(key)
end
local counts, err = red:commit_pipeline()
if not counts then
kong.log.err("failed to get counts from Redis: ", err)
return nil, nil, err
end
local periods = {}
for period in pairs(limits) do
table.insert(periods, period)
end
for i, count in ipairs(counts) do
local period = periods[i]
if count == ngx.null then
count = 0
end
usage[period] = tonumber(count)
if usage[period] and limits[period] and usage[period] >= limits[period] then
stop = true
end
end
-- Add current request to counts
red:init_pipeline()
for period, limit in pairs(limits) do
local key = "ratelimit:" .. identifier .. ":" .. period .. ":" .. conf.service_id
red:incr(key)
red:expire(key, EXPIRATION)
end
local _, err = red:commit_pipeline()
if err then
kong.log.err("failed to increment counts in Redis: ", err)
return nil, nil, err
end
local ok, err = red:set_keepalive(10000, 100)
if not ok then
kong.log.err("failed to set Redis keepalive: ", err)
end
else
-- Fall back to local policy
return policy_cluster.usage(conf, identifier, current_timestamp, limits)
end
return usage, stop
end
function CustomRateLimiting:access(conf)
local current_timestamp = timestamp.get_utc()
-- Get the identification of the consumer
local identifier = get_identifier(conf)
if not identifier then
kong.log.err("cannot identify the consumer, rate limiting skipped")
return
end
-- Load and parse consumer metadata for custom limits
local consumer = kong.client.get_consumer()
local custom_limits = {}
if consumer then
local metadata = kong.db.consumers:select_by_id(consumer.id).meta
if metadata and metadata.custom_rate_limits then
for k, v in pairs(metadata.custom_rate_limits) do
custom_limits[k] = v
end
end
end
-- Build the limits table based on conf
local limits = {}
if conf.second and conf.second > 0 then
limits.second = custom_limits.second or conf.second
end
if conf.minute and conf.minute > 0 then
limits.minute = custom_limits.minute or conf.minute
end
if conf.hour and conf.hour > 0 then
limits.hour = custom_limits.hour or conf.hour
end
if conf.day and conf.day > 0 then
limits.day = custom_limits.day or conf.day
end
if conf.month and conf.month > 0 then
limits.month = custom_limits.month or conf.month
end
if conf.year and conf.year > 0 then
limits.year = custom_limits.year or conf.year
end
-- Check if any of the limits is set
if not next(limits) then
kong.log.err("no limit is specified, rate limiting skipped")
return
end
-- Get the usage of the consumer
local usage, stop, err = get_usage(conf, identifier, current_timestamp, limits)
if err then
kong.log.err("failed to get usage: ", err)
return
end
-- If the consumer exceeded any of the limits, reject the request
if stop then
return kong.response.exit(429, { message = "API rate limit exceeded" })
end
-- Append the X-RateLimit-* headers if not disabled
if not conf.hide_client_headers then
for k, v in pairs(usage) do
kong.response.set_header("X-RateLimit-" .. k .. "-Limit", limits[k])
kong.response.set_header("X-RateLimit-" .. k .. "-Remaining", math.max(0, limits[k] - usage[k]))
end
end
kong.ctx.plugin.rate_limit = {
limit = limits,
usage = usage,
}
end
return CustomRateLimiting
• Long-term: Implemented a comprehensive API gateway management strategy:
- Created a centralized rate limiting configuration management system
- Implemented dynamic rate limiting based on service health metrics
- Developed a rate limiting testing framework
- Established clear procedures for rate limiting policy changes
- Implemented monitoring and alerting for rate limiting events
Lessons Learned:
API gateway rate limiting requires careful configuration to balance protection and availability.
How to Avoid:
Implement rate limiting based on authentication tokens, not IP addresses.
Configure appropriate limits based on service capacity and user tiers.
Test rate limiting policies under load before deployment.
Return rate limiting headers to clients for proper backoff implementation.
Monitor rate limiting metrics and adjust policies as needed.
No summary provided
What Happened:
During a scheduled maintenance window, the operations team initiated an automated certificate rotation for the Istio service mesh. Shortly after the rotation began, services started experiencing connection failures. Within minutes, the failure cascaded across the entire mesh, resulting in a complete production outage. The incident affected all services using mutual TLS for communication, which included critical business applications.
Diagnosis Steps:
Analyzed Istio control plane logs to understand the certificate rotation process.
Examined Envoy proxy logs from affected workloads.
Reviewed certificate issuance and distribution metrics.
Checked Kubernetes events and pod status across the cluster.
Monitored network traffic patterns between services.
Root Cause:
The investigation revealed multiple issues with the certificate rotation process: 1. The certificate rotation was triggered while some nodes were undergoing maintenance 2. The Istio control plane had insufficient resources to handle the certificate generation load 3. A race condition in the certificate distribution process caused some workloads to receive incomplete certificate chains 4. The certificate validation in Envoy proxies was too strict, rejecting certificates with minor issues 5. There was no graceful fallback mechanism when certificate validation failed
Fix/Workaround:
• Short-term: Implemented immediate fixes to restore service:
# Rollback to previous certificates
kubectl rollout restart deployment istiod -n istio-system
# Force reload of proxies with previous certificates
kubectl get pods --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.containerStatuses[*].ready | grep -v "true" | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n
• Created a more robust certificate rotation script:
#!/bin/bash
# safe_cert_rotation.sh - Safely rotate Istio certificates with validation
set -e
# Configuration
NAMESPACE="istio-system"
ISTIOD_DEPLOYMENT="istiod"
CERT_VALIDITY_DAYS=30
MAX_UNAVAILABLE_PERCENT=10
ROTATION_TIMEOUT=1800 # 30 minutes
VALIDATION_INTERVAL=10
ROLLBACK_ON_FAILURE=true
# Check prerequisites
if ! command -v kubectl &> /dev/null; then
echo "kubectl not found. Please install kubectl."
exit 1
fi
if ! command -v jq &> /dev/null; then
echo "jq not found. Please install jq."
exit 1
fi
# Verify cluster connectivity
echo "Verifying cluster connectivity..."
kubectl get nodes &> /dev/null || { echo "Cannot connect to Kubernetes cluster"; exit 1; }
# Check Istio control plane health
echo "Checking Istio control plane health..."
ISTIOD_READY=$(kubectl get deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE -o jsonpath='{.status.readyReplicas}')
ISTIOD_TOTAL=$(kubectl get deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE -o jsonpath='{.status.replicas}')
if [ "$ISTIOD_READY" != "$ISTIOD_TOTAL" ]; then
echo "Warning: Istio control plane is not fully ready ($ISTIOD_READY/$ISTIOD_TOTAL replicas ready)"
read -p "Continue anyway? (y/n) " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
exit 1
fi
fi
# Check for ongoing node maintenance
NODES_NOT_READY=$(kubectl get nodes -o jsonpath='{.items[?(@.status.conditions[?(@.type=="Ready")].status!="True")].metadata.name}')
if [ ! -z "$NODES_NOT_READY" ]; then
echo "Warning: Some nodes are not ready: $NODES_NOT_READY"
read -p "Continue anyway? (y/n) " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
exit 1
fi
fi
# Backup current certificates
echo "Backing up current certificates..."
BACKUP_DIR="istio-certs-backup-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
kubectl get secret -n $NAMESPACE -l istio.io/cert-management=true -o json > "$BACKUP_DIR/cert-secrets.json"
kubectl get configmap -n $NAMESPACE -l istio.io/cert-management=true -o json > "$BACKUP_DIR/cert-configmaps.json"
echo "Certificates backed up to $BACKUP_DIR"
# Scale up Istio control plane for rotation
echo "Scaling up Istio control plane for certificate rotation..."
ORIGINAL_REPLICAS=$ISTIOD_READY
ROTATION_REPLICAS=$((ORIGINAL_REPLICAS + 2))
kubectl scale deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --replicas=$ROTATION_REPLICAS
echo "Waiting for control plane scale up..."
kubectl rollout status deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --timeout=300s
# Start certificate rotation
echo "Initiating certificate rotation..."
kubectl delete secret cacerts -n $NAMESPACE || true
# Generate new root certificate with longer validity
cat > ca.conf << EOF
[ req ]
default_bits = 4096
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn
[ dn ]
O = Example Organization
CN = Example Root CA
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
DNS.1 = istiod.istio-system.svc
[ v3_ca ]
basicConstraints = critical, CA:TRUE
keyUsage = critical, digitalSignature, keyEncipherment, keyCertSign
EOF
openssl genrsa -out root-key.pem 4096
openssl req -new -key root-key.pem -config ca.conf -out root-cert.csr
openssl x509 -req -days $CERT_VALIDITY_DAYS -in root-cert.csr -signkey root-key.pem -out root-cert.pem -extensions v3_ca -extfile ca.conf
# Generate intermediate certificates
openssl genrsa -out ca-key.pem 4096
openssl req -new -key ca-key.pem -out ca-cert.csr -config ca.conf
openssl x509 -req -days $CERT_VALIDITY_DAYS -in ca-cert.csr -CA root-cert.pem -CAkey root-key.pem -CAcreateserial -out ca-cert.pem -extensions v3_ca -extfile ca.conf
# Create chain certificate
cat ca-cert.pem root-cert.pem > cert-chain.pem
# Create Kubernetes secret
kubectl create secret generic cacerts -n $NAMESPACE \
--from-file=ca-cert.pem \
--from-file=ca-key.pem \
--from-file=root-cert.pem \
--from-file=cert-chain.pem
# Restart Istio control plane to pick up new certificates
echo "Restarting Istio control plane with new certificates..."
kubectl rollout restart deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE
kubectl rollout status deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --timeout=300s
# Monitor certificate distribution
echo "Monitoring certificate distribution..."
start_time=$(date +%s)
end_time=$((start_time + ROTATION_TIMEOUT))
success=false
while [ $(date +%s) -lt $end_time ]; do
# Check certificate distribution progress
total_pods=$(kubectl get pods --all-namespaces -l istio.io/rev -o json | jq '.items | length')
updated_pods=$(kubectl get pods --all-namespaces -l istio.io/rev -o json | jq '[.items[] | select(.metadata.annotations["istio.io/cert-update-status"] == "updated")] | length')
if [ "$total_pods" -eq 0 ]; then
echo "No Istio-injected pods found. Is Istio properly installed?"
break
fi
percent_complete=$((updated_pods * 100 / total_pods))
echo "Certificate rotation progress: $percent_complete% ($updated_pods/$total_pods pods updated)"
if [ "$percent_complete" -eq 100 ]; then
echo "Certificate rotation completed successfully!"
success=true
break
fi
# Check for failures
failed_pods=$(kubectl get pods --all-namespaces -o json | jq '[.items[] | select(.status.containerStatuses != null) | select(.status.containerStatuses[].ready == false)] | length')
failed_percent=$((failed_pods * 100 / total_pods))
if [ "$failed_percent" -gt "$MAX_UNAVAILABLE_PERCENT" ]; then
echo "Error: Too many pods are failing ($failed_percent% > $MAX_UNAVAILABLE_PERCENT%)"
if [ "$ROLLBACK_ON_FAILURE" = true ]; then
echo "Initiating rollback..."
break
fi
fi
sleep $VALIDATION_INTERVAL
done
# Validate service mesh health
echo "Validating service mesh health..."
if ! $success; then
echo "Certificate rotation did not complete within the timeout period or failed"
if [ "$ROLLBACK_ON_FAILURE" = true ]; then
echo "Rolling back to previous certificates..."
kubectl delete secret cacerts -n $NAMESPACE
kubectl create -f "$BACKUP_DIR/cert-secrets.json"
kubectl rollout restart deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE
kubectl rollout status deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --timeout=300s
echo "Rollback completed. Restoring original replica count..."
kubectl scale deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --replicas=$ORIGINAL_REPLICAS
exit 1
fi
fi
# Scale down Istio control plane to original size
echo "Scaling down Istio control plane to original size..."
kubectl scale deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --replicas=$ORIGINAL_REPLICAS
kubectl rollout status deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --timeout=300s
echo "Certificate rotation completed successfully"
exit 0
• Implemented a Go-based certificate validation tool:
// certvalidator/main.go
package main
import (
"context"
"crypto/x509"
"encoding/pem"
"flag"
"fmt"
"io/ioutil"
"log"
"os"
"path/filepath"
"strings"
"sync"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/clientcmd"
)
type CertInfo struct {
Subject string
Issuer string
NotBefore time.Time
NotAfter time.Time
IsCA bool
DNSNames []string
KeyUsage x509.KeyUsage
ExtKeyUsage []x509.ExtKeyUsage
}
func main() {
var kubeconfig string
var namespace string
var allNamespaces bool
var verbose bool
var threshold int
flag.StringVar(&kubeconfig, "kubeconfig", "", "Path to kubeconfig file")
flag.StringVar(&namespace, "namespace", "istio-system", "Namespace to check")
flag.BoolVar(&allNamespaces, "all-namespaces", false, "Check all namespaces")
flag.BoolVar(&verbose, "verbose", false, "Verbose output")
flag.IntVar(&threshold, "threshold", 30, "Warning threshold for certificate expiration in days")
flag.Parse()
// Create Kubernetes client
var config *rest.Config
var err error
if kubeconfig == "" {
log.Println("Using in-cluster configuration")
config, err = rest.InClusterConfig()
} else {
log.Printf("Using configuration from %s", kubeconfig)
config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
}
if err != nil {
log.Fatalf("Error building kubeconfig: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Fatalf("Error creating Kubernetes client: %v", err)
}
// Get namespaces to check
var namespaces []string
if allNamespaces {
nsList, err := clientset.CoreV1().Namespaces().List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Fatalf("Error listing namespaces: %v", err)
}
for _, ns := range nsList.Items {
namespaces = append(namespaces, ns.Name)
}
} else {
namespaces = []string{namespace}
}
// Check certificates in each namespace
var wg sync.WaitGroup
for _, ns := range namespaces {
wg.Add(1)
go func(namespace string) {
defer wg.Done()
checkNamespace(clientset, namespace, threshold, verbose)
}(ns)
}
wg.Wait()
}
func checkNamespace(clientset *kubernetes.Clientset, namespace string, threshold int, verbose bool) {
log.Printf("Checking certificates in namespace %s", namespace)
// Check secrets
secrets, err := clientset.CoreV1().Secrets(namespace).List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Error listing secrets in namespace %s: %v", namespace, err)
return
}
for _, secret := range secrets.Items {
// Skip non-TLS secrets
if !strings.Contains(secret.Type, "tls") && !strings.Contains(secret.Type, "TLS") {
continue
}
log.Printf("Checking secret %s/%s", namespace, secret.Name)
// Check each certificate in the secret
for key, data := range secret.Data {
if !strings.Contains(key, "crt") && !strings.Contains(key, "cert") && !strings.Contains(key, "ca.pem") {
continue
}
certInfo, err := parseCertificate(data)
if err != nil {
log.Printf("Error parsing certificate in %s/%s[%s]: %v", namespace, secret.Name, key, err)
continue
}
// Check certificate validity
now := time.Now()
if now.Before(certInfo.NotBefore) {
log.Printf("WARNING: Certificate in %s/%s[%s] is not yet valid (valid from %s)",
namespace, secret.Name, key, certInfo.NotBefore)
}
if now.After(certInfo.NotAfter) {
log.Printf("ERROR: Certificate in %s/%s[%s] has expired (valid until %s)",
namespace, secret.Name, key, certInfo.NotAfter)
}
daysUntilExpiration := int(certInfo.NotAfter.Sub(now).Hours() / 24)
if daysUntilExpiration < threshold {
log.Printf("WARNING: Certificate in %s/%s[%s] will expire in %d days (on %s)",
namespace, secret.Name, key, daysUntilExpiration, certInfo.NotAfter)
}
if verbose {
log.Printf("Certificate details for %s/%s[%s]:", namespace, secret.Name, key)
log.Printf(" Subject: %s", certInfo.Subject)
log.Printf(" Issuer: %s", certInfo.Issuer)
log.Printf(" Valid from: %s to %s", certInfo.NotBefore, certInfo.NotAfter)
log.Printf(" Is CA: %t", certInfo.IsCA)
log.Printf(" DNS names: %v", certInfo.DNSNames)
}
}
}
// Check configmaps for certificates
configmaps, err := clientset.CoreV1().ConfigMaps(namespace).List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Error listing configmaps in namespace %s: %v", namespace, err)
return
}
for _, configmap := range configmaps.Items {
for key, data := range configmap.Data {
if !strings.Contains(key, "crt") && !strings.Contains(key, "cert") && !strings.Contains(key, "ca.pem") {
continue
}
certInfo, err := parseCertificate([]byte(data))
if err != nil {
log.Printf("Error parsing certificate in configmap %s/%s[%s]: %v",
namespace, configmap.Name, key, err)
continue
}
// Check certificate validity
now := time.Now()
if now.Before(certInfo.NotBefore) {
log.Printf("WARNING: Certificate in configmap %s/%s[%s] is not yet valid (valid from %s)",
namespace, configmap.Name, key, certInfo.NotBefore)
}
if now.After(certInfo.NotAfter) {
log.Printf("ERROR: Certificate in configmap %s/%s[%s] has expired (valid until %s)",
namespace, configmap.Name, key, certInfo.NotAfter)
}
daysUntilExpiration := int(certInfo.NotAfter.Sub(now).Hours() / 24)
if daysUntilExpiration < threshold {
log.Printf("WARNING: Certificate in configmap %s/%s[%s] will expire in %d days (on %s)",
namespace, configmap.Name, key, daysUntilExpiration, certInfo.NotAfter)
}
if verbose {
log.Printf("Certificate details for configmap %s/%s[%s]:", namespace, configmap.Name, key)
log.Printf(" Subject: %s", certInfo.Subject)
log.Printf(" Issuer: %s", certInfo.Issuer)
log.Printf(" Valid from: %s to %s", certInfo.NotBefore, certInfo.NotAfter)
log.Printf(" Is CA: %t", certInfo.IsCA)
log.Printf(" DNS names: %v", certInfo.DNSNames)
}
}
}
}
func parseCertificate(data []byte) (*CertInfo, error) {
block, _ := pem.Decode(data)
if block == nil {
return nil, fmt.Errorf("failed to decode PEM block")
}
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
return nil, err
}
return &CertInfo{
Subject: cert.Subject.String(),
Issuer: cert.Issuer.String(),
NotBefore: cert.NotBefore,
NotAfter: cert.NotAfter,
IsCA: cert.IsCA,
DNSNames: cert.DNSNames,
KeyUsage: cert.KeyUsage,
ExtKeyUsage: cert.ExtKeyUsage,
}, nil
}
• Updated Istio configuration for more resilient certificate handling:
# istio-operator.yaml - Updated configuration for certificate handling
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
namespace: istio-system
name: istio-control-plane
spec:
profile: default
components:
pilot:
k8s:
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
hpaSpec:
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80
ingressGateways:
- name: istio-ingressgateway
enabled: true
meshConfig:
defaultConfig:
proxyMetadata:
ISTIO_META_DNS_CAPTURE: "true"
ISTIO_META_DNS_AUTO_ALLOCATE: "true"
enablePrometheusMerge: true
enableTracing: true
accessLogFile: "/dev/stdout"
accessLogFormat: |
[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS% %RESPONSE_CODE_DETAILS% %CONNECTION_TERMINATION_DETAILS% "%UPSTREAM_TRANSPORT_FAILURE_REASON%" %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" %UPSTREAM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME% %ROUTE_NAME%
rootNamespace: istio-system
trustDomain: cluster.local
caCertificatesPersistenceEnabled: true
certificateRotationPeriod: 720h # 30 days
certificateRotationGracePeriod: 168h # 7 days
defaultServiceExportTo:
- "*"
defaultVirtualServiceExportTo:
- "*"
defaultDestinationRuleExportTo:
- "*"
values:
global:
proxy:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 2000m
memory: 1024Mi
holdApplicationUntilProxyStarts: true
tracer:
zipkin:
address: zipkin.istio-system:9411
proxy_init:
resources:
limits:
cpu: 2000m
memory: 1024Mi
requests:
cpu: 10m
memory: 10Mi
logging:
level: "default:info"
pilotCertProvider: istiod
jwtPolicy: third-party-jwt
caAddress: istiod.istio-system.svc:15012
mountMtlsCerts: true
pilot:
env:
PILOT_CERT_PROVIDER: istiod
PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND: "true"
PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_INBOUND: "true"
PILOT_ENABLE_CERTIFICATE_ROTATION_GRACE_PERIOD: "true"
PILOT_CERTIFICATE_ROTATION_GRACE_PERIOD_PERCENT: "20"
PILOT_ENABLE_CERTIFICATE_ROTATION_FAILURE_RECOVERY: "true"
PILOT_CERTIFICATE_ROTATION_FAILURE_RETRY_DELAY: "1m"
PILOT_CERTIFICATE_ROTATION_MAX_RETRIES: "10"
• Long-term: Implemented a comprehensive certificate management strategy:
- Created a certificate rotation runbook with pre-flight checks
- Implemented automated certificate monitoring with alerting
- Developed a certificate rotation testing framework
- Established clear incident response procedures for certificate issues
- Implemented certificate rotation simulation in chaos testing
Lessons Learned:
Certificate rotation in service meshes requires careful planning and robust fallback mechanisms.
How to Avoid:
Implement proper resource allocation for certificate management components.
Create a gradual certificate rotation strategy with validation at each step.
Test certificate rotation procedures in non-production environments.
Implement monitoring for certificate-related metrics and alerts.
Establish clear rollback procedures for certificate rotation failures.
No summary provided
What Happened:
At 2:00 AM, monitoring systems detected a sudden spike in connection failures across multiple services in a production Kubernetes cluster using Istio service mesh. Users reported widespread "connection refused" errors, and internal services were unable to communicate with each other. The incident coincided with the scheduled expiration of TLS certificates used for mTLS communication within the service mesh. The automated certificate rotation process had failed silently several days earlier, but the issue only became apparent when the certificates actually expired.
Diagnosis Steps:
Analyzed Istio proxy logs to identify TLS handshake failures.
Checked certificate expiration dates using OpenSSL commands.
Reviewed certificate issuance and rotation automation logs.
Examined Istio control plane components for errors.
Verified certificate authority (CA) functionality.
Root Cause:
The investigation revealed multiple issues with the certificate management: 1. The certificate rotation job had failed due to an API permission change 2. No alerting was configured for failed certificate rotation attempts 3. Certificate expiration monitoring was missing 4. The rotation job failure was logged but not escalated 5. Certificate lifetimes were too short (7 days) with no buffer period
Fix/Workaround:
• Implemented immediate manual certificate rotation to restore service
• Created a comprehensive certificate management strategy
• Added monitoring and alerting for certificate expiration
• Extended certificate lifetimes with appropriate overlap
• Implemented automated testing of the rotation process
Lessons Learned:
Certificate management in service meshes requires robust automation, monitoring, and failure detection.
How to Avoid:
Implement certificate expiration monitoring with alerts at multiple thresholds.
Configure longer certificate lifetimes with appropriate overlap periods.
Test certificate rotation processes regularly in non-production environments.
Create alerting for failed rotation attempts, not just expiration events.
Document and practice manual certificate rotation procedures.
No summary provided
What Happened:
During a marketing campaign launch, a company's public API experienced a sudden traffic surge. Despite having rate limiting configured in the Kong API Gateway, the traffic overwhelmed backend services, causing cascading failures across the platform. The operations team had to implement emergency measures to restore service, including temporarily blocking certain client IPs and scaling up backend services. Post-incident analysis revealed that the rate limiting configuration was ineffective under the specific traffic patterns experienced.
Diagnosis Steps:
Analyzed API gateway logs for traffic patterns and rate limiting behavior.
Examined backend service metrics during the incident.
Reviewed rate limiting configuration in the API gateway.
Tested rate limiting effectiveness under various traffic patterns.
Compared global and route-specific rate limiting settings.
Root Cause:
The investigation revealed multiple issues with the rate limiting configuration: 1. Rate limits were configured per route but not globally across routes 2. The rate limiting window was too large (1 minute), allowing traffic bursts 3. Rate limiting was based on client IP, but traffic came through a load balancer with IP masking 4. No rate limiting was applied to authenticated vs. unauthenticated requests 5. The rate limiting plugin was configured with "continue on error" mode
Fix/Workaround:
• Implemented immediate fixes to protect backend services
• Reconfigured rate limiting with appropriate granularity
• Added global and route-specific limits with proper windows
• Implemented advanced identification beyond client IP
• Created tiered rate limiting based on client importance
Lessons Learned:
API gateway rate limiting requires careful configuration and testing under realistic traffic patterns.
How to Avoid:
Implement multi-layered rate limiting (global, service, route).
Test rate limiting under various traffic patterns, including bursts.
Configure appropriate identification methods beyond client IP.
Create tiered rate limiting based on client authentication and importance.
Monitor rate limiting effectiveness and adjust based on traffic patterns.
No summary provided
What Happened:
A company implemented JWT-based authentication for their APIs using Kong API Gateway. After several weeks in production, the security team discovered that certain protected endpoints were accessible without valid authentication. Investigation revealed that the JWT validation configuration in the API gateway was incorrectly implemented, allowing requests with malformed or expired tokens to pass through to backend services. This created a significant security vulnerability that potentially exposed sensitive data.
Diagnosis Steps:
Analyzed API gateway logs for authentication patterns.
Tested endpoints with various token configurations.
Reviewed JWT validation plugin configuration.
Examined token issuance and validation flow.
Verified claims validation and signature verification settings.
Root Cause:
The investigation revealed multiple issues with the JWT validation: 1. The JWT signature verification was misconfigured with incorrect public keys 2. Token expiration validation was not properly enforced 3. Required claims validation was incomplete 4. The plugin configuration was inconsistently applied across routes 5. Error handling allowed certain invalid tokens to pass through
Fix/Workaround:
• Implemented immediate fixes to secure all endpoints
• Corrected JWT validation configuration with proper signature verification
• Enforced token expiration and claims validation
• Standardized plugin configuration across all routes
• Improved error handling for invalid tokens
Lessons Learned:
API gateway authentication requires careful configuration and comprehensive testing.
How to Avoid:
Implement comprehensive security testing for API gateway configurations.
Create automated validation tests for authentication mechanisms.
Standardize authentication plugin configuration across routes.
Regularly audit authentication logs for unusual patterns.
Establish clear ownership and review processes for security configurations.
```yaml
Example of proper Kong JWT plugin configuration
plugins:
- name: jwt
config:
Properly configured claims validation
claims_to_verify:
- exp
- nbf
Multiple signature verification algorithms
algorithms:
- RS256
- ES256
Proper key configuration
key_claim_name: kid
secret_is_base64: false
Comprehensive validation settings
run_on_preflight: true
maximum_expiration: 86400
Proper error handling
uri_param_names:
- jwt
cookie_names: []
header_names:
- Authorization
Enforce token format
token_format:
bearer: true
base64: false
```
```go
// Example Go code for proper JWT token validation
package main
import (
"fmt"
"net/http"
"strings"
"time"
"github.com/golang-jwt/jwt/v4"
)
// Define custom claims with proper validation fields
type CustomClaims struct {
Permissions []string
json:"permissions"
jwt.RegisteredClaims
}
// JWT validation middleware with comprehensive checks
func JWTMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Extract token from Authorization header
authHeader := r.Header.Get("Authorization")
if authHeader == "" {
http.Error(w, "Authorization header required", http.StatusUnauthorized)
return
}
// Validate Bearer format
bearerPrefix := "Bearer "
if !strings.HasPrefix(authHeader, bearerPrefix) {
http.Error(w, "Invalid authorization format", http.StatusUnauthorized)
return
}
// Extract token
tokenString := strings.TrimPrefix(authHeader, bearerPrefix)
// Parse and validate token with custom claims
token, err := jwt.ParseWithClaims(tokenString, &CustomClaims{}, func(token *jwt.Token) (interface{}, error) {
// Validate signing algorithm
if _, ok := token.Method.(*jwt.SigningMethodRSA); !ok {
return nil, fmt.Errorf("unexpected signing method: %v", token.Header["alg"])
}
// Get key ID from token header
kid, ok := token.Header["kid"].(string)
if !ok {
return nil, fmt.Errorf("key ID not found in token")
}
// Retrieve public key based on key ID (implementation depends on key management)
publicKey, err := getPublicKey(kid)
if err != nil {
return nil, err
}
return publicKey, nil
})
// Handle validation errors with specific error messages
if err != nil {
switch {
case strings.Contains(err.Error(), "token is expired"):
http.Error(w, "Token expired", http.StatusUnauthorized)
case strings.Contains(err.Error(), "signature is invalid"):
http.Error(w, "Invalid token signature", http.StatusUnauthorized)
default:
http.Error(w, "Invalid token: "+err.Error(), http.StatusUnauthorized)
}
return
}
// Validate token claims
if claims, ok := token.Claims.(*CustomClaims); ok && token.Valid {
// Additional custom validation
if !hasRequiredPermissions(claims.Permissions, r.URL.Path, r.Method) {
http.Error(w, "Insufficient permissions", http.StatusForbidden)
return
}
// Set claims in request context for downstream handlers
ctx := setClaimsContext(r.Context(), claims)
next.ServeHTTP(w, r.WithContext(ctx))
} else {
http.Error(w, "Invalid token claims", http.StatusUnauthorized)
}
})
}
// Helper functions (implementation details omitted)
func getPublicKey(kid string) (interface{}, error) {
// Implementation would retrieve the correct public key based on key ID
return nil, nil
}
func hasRequiredPermissions(permissions []string, path, method string) bool {
// Implementation would check if the token has the required permissions
return true
}
func setClaimsContext(ctx interface{}, claims *CustomClaims) interface{} {
// Implementation would set the claims in the request context
return ctx
}
No summary provided
What Happened:
A financial services company using Istio service mesh for their microservices architecture experienced widespread authentication failures during peak business hours. Services began reporting TLS handshake errors, and inter-service communication broke down across multiple critical applications. The incident caused a partial outage affecting customer-facing services. Investigation revealed that the Istio-managed mTLS certificates had expired, and the automatic rotation mechanism had silently failed weeks earlier without triggering alerts.
Diagnosis Steps:
Analyzed service mesh proxy logs for error patterns.
Examined certificate expiration dates across the mesh.
Reviewed certificate issuance and rotation configurations.
Checked certificate authority (CA) status and health.
Investigated recent changes to the service mesh configuration.
Root Cause:
The investigation revealed multiple issues with certificate management: 1. The Istio certificate authority (istiod) had insufficient permissions to write to the certificate storage location 2. Certificate rotation logs were being discarded due to a misconfigured log level 3. No monitoring was in place for certificate expiration or rotation failures 4. A recent security hardening change had modified the certificate storage permissions 5. The certificate rotation failure predated the expiration by several weeks
Fix/Workaround:
• Implemented immediate fixes to restore service
• Manually rotated all expired certificates
• Corrected permissions for certificate storage locations
• Configured proper logging for certificate operations
• Implemented certificate expiration monitoring and alerting
# Istio Certificate Monitoring Configuration
# File: istio-cert-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-cert-expiry-alerts
namespace: istio-system
spec:
groups:
- name: istio-cert-expiry
rules:
# Alert when workload certificates are nearing expiration
- alert: IstioWorkloadCertExpiringSoon
expr: |
(
max by(source_workload, source_namespace) (
envoy_server_ssl_socket_factory_context_ssl_context_days_until_first_cert_expires{
reporter="source"
}
) < 7
)
for: 1h
labels:
severity: warning
team: platform
annotations:
summary: "Istio workload certificate expiring soon"
description: "Workload {{ $labels.source_workload }} in namespace {{ $labels.source_namespace }} has a certificate that will expire in {{ $value }} days."
# Critical alert for imminent expiration
- alert: IstioWorkloadCertExpiringCritical
expr: |
(
max by(source_workload, source_namespace) (
envoy_server_ssl_socket_factory_context_ssl_context_days_until_first_cert_expires{
reporter="source"
}
) < 2
)
for: 10m
labels:
severity: critical
team: platform
annotations:
summary: "Istio workload certificate critically close to expiration"
description: "CRITICAL: Workload {{ $labels.source_workload }} in namespace {{ $labels.source_namespace }} has a certificate that will expire in {{ $value }} days."
# Alert when istiod certificates are nearing expiration
- alert: IstiodCertExpiringSoon
expr: |
(
max by(job) (
citadel_server_cert_expiry_seconds / 86400 < 30
)
)
for: 1h
labels:
severity: warning
team: platform
annotations:
summary: "Istiod certificate expiring soon"
description: "Istiod certificate will expire in {{ $value }} days."
# Alert for certificate rotation failures
- alert: IstioCertRotationFailure
expr: |
increase(citadel_server_csr_sign_error_count[1h]) > 0
for: 15m
labels:
severity: critical
team: platform
annotations:
summary: "Istio certificate rotation failures detected"
description: "Istio certificate rotation has been failing for the last 15 minutes."
---
apiVersion: v1
kind: ConfigMap
metadata:
name: istio-cert-checker
namespace: istio-system
data:
cert-checker.sh: |
#!/bin/bash
# Script to check Istio certificate health
# Check istiod certificate
echo "Checking istiod certificate..."
ISTIOD_CERT_EXPIRY=$(kubectl exec -n istio-system deployment/istiod -- sh -c "openssl x509 -in /etc/certs/cert-chain.pem -noout -dates | grep notAfter | cut -d= -f2")
ISTIOD_EXPIRY_SECONDS=$(date -d "$ISTIOD_CERT_EXPIRY" +%s)
NOW_SECONDS=$(date +%s)
DAYS_REMAINING=$(( ($ISTIOD_EXPIRY_SECONDS - $NOW_SECONDS) / 86400 ))
echo "Istiod certificate expires in $DAYS_REMAINING days"
if [ $DAYS_REMAINING -lt 30 ]; then
echo "WARNING: Istiod certificate expiring soon!"
fi
# Check workload certificates (sample of pods)
echo "Checking workload certificates..."
NAMESPACES=$(kubectl get namespace -l istio-injection=enabled -o jsonpath='{.items[*].metadata.name}')
for NS in $NAMESPACES; do
PODS=$(kubectl get pods -n $NS -o jsonpath='{.items[*].metadata.name}')
for POD in $PODS; do
if kubectl exec -n $NS $POD -c istio-proxy -- ls /etc/certs/cert-chain.pem > /dev/null 2>&1; then
CERT_EXPIRY=$(kubectl exec -n $NS $POD -c istio-proxy -- sh -c "openssl x509 -in /etc/certs/cert-chain.pem -noout -dates | grep notAfter | cut -d= -f2")
EXPIRY_SECONDS=$(date -d "$CERT_EXPIRY" +%s)
POD_DAYS_REMAINING=$(( ($EXPIRY_SECONDS - $NOW_SECONDS) / 86400 ))
echo "Pod $POD in namespace $NS: Certificate expires in $POD_DAYS_REMAINING days"
if [ $POD_DAYS_REMAINING -lt 7 ]; then
echo "WARNING: Certificate for $POD in $NS expiring soon!"
fi
fi
done
done
# Check certificate rotation logs
echo "Checking certificate rotation logs..."
ROTATION_ERRORS=$(kubectl logs -n istio-system -l app=istiod --tail=1000 | grep -c "Failed to sign CSR")
if [ $ROTATION_ERRORS -gt 0 ]; then
echo "ERROR: Detected $ROTATION_ERRORS certificate rotation failures in recent logs!"
else
echo "No recent certificate rotation errors detected"
fi
# Check CA permissions
echo "Checking CA permissions..."
kubectl exec -n istio-system deployment/istiod -- ls -la /var/run/secrets/istio-dns
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: istio-cert-checker
namespace: istio-system
spec:
schedule: "0 */6 * * *" # Run every 6 hours
jobTemplate:
spec:
template:
spec:
serviceAccountName: istio-cert-checker
containers:
- name: cert-checker
image: istio/kubectl:latest
command:
- /bin/bash
- /scripts/cert-checker.sh
volumeMounts:
- name: scripts
mountPath: /scripts
volumes:
- name: scripts
configMap:
name: istio-cert-checker
defaultMode: 0755
restartPolicy: OnFailure
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: istio-cert-checker
namespace: istio-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: istio-cert-checker
rules:
- apiGroups: [""]
resources: ["pods", "namespaces"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: istio-cert-checker
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: istio-cert-checker
subjects:
- kind: ServiceAccount
name: istio-cert-checker
namespace: istio-system
// Go implementation of certificate rotation verification
// File: cert-rotation-verifier.go
package main
import (
"context"
"crypto/x509"
"encoding/pem"
"fmt"
"io/ioutil"
"log"
"os"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
const (
// Minimum acceptable certificate lifetime
minCertLifetime = 7 * 24 * time.Hour
// Critical certificate lifetime threshold
criticalCertLifetime = 2 * 24 * time.Hour
)
func main() {
// Create Kubernetes client
config, err := rest.InClusterConfig()
if err != nil {
log.Fatalf("Failed to create in-cluster config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Fatalf("Failed to create Kubernetes client: %v", err)
}
// Get all namespaces with Istio injection enabled
namespaces, err := clientset.CoreV1().Namespaces().List(context.TODO(), metav1.ListOptions{
LabelSelector: "istio-injection=enabled",
})
if err != nil {
log.Fatalf("Failed to list namespaces: %v", err)
}
// Track certificate issues
var issues []string
// Check istiod certificates
istiodIssues := checkIstiodCertificates(clientset)
issues = append(issues, istiodIssues...)
// Check workload certificates
for _, ns := range namespaces.Items {
namespace := ns.Name
pods, err := clientset.CoreV1().Pods(namespace).List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Failed to list pods in namespace %s: %v", namespace, err)
continue
}
for _, pod := range pods.Items {
podIssues := checkPodCertificates(clientset, pod.Name, namespace)
issues = append(issues, podIssues...)
}
}
// Check certificate rotation logs
rotationIssues := checkCertificateRotationLogs(clientset)
issues = append(issues, rotationIssues...)
// Report issues
if len(issues) > 0 {
log.Printf("Found %d certificate issues:", len(issues))
for i, issue := range issues {
log.Printf("%d. %s", i+1, issue)
}
os.Exit(1)
} else {
log.Println("No certificate issues found.")
}
}
func checkIstiodCertificates(clientset *kubernetes.Clientset) []string {
var issues []string
// Get istiod pods
pods, err := clientset.CoreV1().Pods("istio-system").List(context.TODO(), metav1.ListOptions{
LabelSelector: "app=istiod",
})
if err != nil {
issues = append(issues, fmt.Sprintf("Failed to list istiod pods: %v", err))
return issues
}
for _, pod := range pods.Items {
// Get certificate from istiod pod
certBytes, err := execInPod(clientset, pod.Name, "istio-system", "cat /etc/certs/cert-chain.pem")
if err != nil {
issues = append(issues, fmt.Sprintf("Failed to get certificate from istiod pod %s: %v", pod.Name, err))
continue
}
// Parse certificate
block, _ := pem.Decode(certBytes)
if block == nil {
issues = append(issues, fmt.Sprintf("Failed to decode PEM certificate from istiod pod %s", pod.Name))
continue
}
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
issues = append(issues, fmt.Sprintf("Failed to parse certificate from istiod pod %s: %v", pod.Name, err))
continue
}
// Check expiration
timeUntilExpiry := time.Until(cert.NotAfter)
if timeUntilExpiry < criticalCertLifetime {
issues = append(issues, fmt.Sprintf("CRITICAL: Istiod certificate in pod %s will expire in %.1f hours",
pod.Name, timeUntilExpiry.Hours()))
} else if timeUntilExpiry < minCertLifetime {
issues = append(issues, fmt.Sprintf("WARNING: Istiod certificate in pod %s will expire in %.1f days",
pod.Name, timeUntilExpiry.Hours()/24))
}
}
return issues
}
func checkPodCertificates(clientset *kubernetes.Clientset, podName, namespace string) []string {
var issues []string
// Check if pod has istio-proxy container
pod, err := clientset.CoreV1().Pods(namespace).Get(context.TODO(), podName, metav1.GetOptions{})
if err != nil {
return issues
}
hasIstioProxy := false
for _, container := range pod.Spec.Containers {
if container.Name == "istio-proxy" {
hasIstioProxy = true
break
}
}
if !hasIstioProxy {
return issues
}
// Get certificate from istio-proxy container
certBytes, err := execInPod(clientset, podName, namespace, "cat /etc/certs/cert-chain.pem")
if err != nil {
issues = append(issues, fmt.Sprintf("Failed to get certificate from pod %s/%s: %v", namespace, podName, err))
return issues
}
// Parse certificate
block, _ := pem.Decode(certBytes)
if block == nil {
issues = append(issues, fmt.Sprintf("Failed to decode PEM certificate from pod %s/%s", namespace, podName))
return issues
}
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
issues = append(issues, fmt.Sprintf("Failed to parse certificate from pod %s/%s: %v", namespace, podName, err))
return issues
}
// Check expiration
timeUntilExpiry := time.Until(cert.NotAfter)
if timeUntilExpiry < criticalCertLifetime {
issues = append(issues, fmt.Sprintf("CRITICAL: Certificate in pod %s/%s will expire in %.1f hours",
namespace, podName, timeUntilExpiry.Hours()))
} else if timeUntilExpiry < minCertLifetime {
issues = append(issues, fmt.Sprintf("WARNING: Certificate in pod %s/%s will expire in %.1f days",
namespace, podName, timeUntilExpiry.Hours()/24))
}
return issues
}
func checkCertificateRotationLogs(clientset *kubernetes.Clientset) []string {
var issues []string
// Get istiod pods
pods, err := clientset.CoreV1().Pods("istio-system").List(context.TODO(), metav1.ListOptions{
LabelSelector: "app=istiod",
})
if err != nil {
issues = append(issues, fmt.Sprintf("Failed to list istiod pods: %v", err))
return issues
}
for _, pod := range pods.Items {
// Get logs from istiod pod
logs, err := getPodLogs(clientset, pod.Name, "istio-system")
if err != nil {
issues = append(issues, fmt.Sprintf("Failed to get logs from istiod pod %s: %v", pod.Name, err))
continue
}
// Check for certificate rotation errors
if containsString(logs, "Failed to sign CSR") {
issues = append(issues, fmt.Sprintf("Certificate rotation failures detected in istiod pod %s", pod.Name))
}
if containsString(logs, "Error rotating certificate") {
issues = append(issues, fmt.Sprintf("Certificate rotation errors detected in istiod pod %s", pod.Name))
}
if containsString(logs, "permission denied") && containsString(logs, "certificate") {
issues = append(issues, fmt.Sprintf("Certificate permission issues detected in istiod pod %s", pod.Name))
}
}
return issues
}
// Helper functions
func execInPod(clientset *kubernetes.Clientset, podName, namespace, command string) ([]byte, error) {
// Implementation omitted for brevity
// This would use the Kubernetes API to execute a command in a pod
return []byte{}, nil
}
func getPodLogs(clientset *kubernetes.Clientset, podName, namespace string) (string, error) {
// Implementation omitted for brevity
// This would use the Kubernetes API to get logs from a pod
return "", nil
}
func containsString(s, substr string) bool {
return true // Simplified for brevity
}
Lessons Learned:
Certificate management in service meshes requires proactive monitoring and alerting to prevent outages due to expiration.
How to Avoid:
Implement certificate expiration monitoring and alerting.
Configure proper logging for certificate rotation operations.
Regularly audit certificate management permissions.
Create automated tests to verify certificate rotation functionality.
Establish clear incident response procedures for certificate-related issues.
No summary provided
What Happened:
A large financial services company used Istio as their service mesh for securing service-to-service communication in their Kubernetes environment. All internal communication was encrypted using mTLS with certificates managed by the mesh's certificate authority. During a weekend, multiple services began experiencing connection failures, and by Monday morning, the entire platform was effectively down. Investigation revealed that the root certificates used by the service mesh had expired, and the automatic rotation mechanism had silently failed weeks earlier. The incident caused a complete production outage requiring manual intervention to restore service.
Diagnosis Steps:
Analyzed connection errors in service logs.
Examined certificate expiration dates across the mesh.
Reviewed certificate rotation logs and configuration.
Checked the status of the certificate authority components.
Tested certificate issuance in a controlled environment.
Root Cause:
The investigation revealed multiple issues with the certificate management: 1. The certificate rotation job had been failing silently for weeks 2. Monitoring for certificate expiration was not implemented 3. The certificate authority's storage was corrupted due to a previous incident 4. The mesh was configured with a short certificate lifetime but no safety margin 5. There was no documented procedure for manual certificate rotation
Fix/Workaround:
• Implemented immediate fix to restore service
• Generated new root certificates with extended validity
• Forced rotation of all service certificates
• Implemented monitoring for certificate expiration
• Created runbooks for manual certificate rotation
Lessons Learned:
Certificate management in service meshes requires robust monitoring, alerting, and fallback procedures.
How to Avoid:
Implement monitoring for certificate expiration with adequate warning time.
Configure certificate rotation with appropriate safety margins.
Test certificate rotation procedures regularly.
Create documented procedures for manual certificate rotation.
Implement alerting for certificate rotation failures.