Example of proper Kong JWT plugin configuration

Question

api_gateway_service_mesh_scenario_01

Answer 1

output:

API Gateway and Service Mesh Istio 1.14, Kubernetes, Production environment

Summary:

No summary provided

What Happened:

During an automated certificate rotation, services began experiencing mutual TLS authentication failures. The issue started with intermittent 503 errors and gradually escalated to widespread service disruption across the mesh.

Diagnosis Steps:

Examined Istio proxy logs for authentication errors.
Checked certificate expiration dates and rotation status.
Verified Istio control plane component health.
Analyzed recent configuration changes and updates.
Tested certificate validation manually.

Root Cause:

The Istio certificate authority (Citadel) was unable to distribute new certificates due to a combination of issues: 1. The Kubernetes secret used for storing the root CA had incorrect permissions 2. A recent Istio upgrade changed the certificate rotation process without updating documentation 3. Custom certificate validation logic in some services rejected the new certificate format

Fix/Workaround:

• Short-term: Restored previous certificates and disabled automatic rotation:


# Patch to disable automatic rotation temporarily
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    certificates:
      - secretName: cacerts
        dnsNames:
          - istio-ca.istio-system.svc
    defaultConfig:
      proxyMetadata:
        ISTIO_META_CERT_ROTATION: "false"

• Long-term: Implemented proper certificate management:


# Proper Istio certificate configuration
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  components:
    pilot:
      k8s:
        env:
          - name: PILOT_CERT_PROVIDER
            value: "istiod"
          - name: PILOT_ENABLE_XDS_CACHE
            value: "true"
    istiod:
      k8s:
        overlays:
        - apiVersion: apps/v1
          kind: Deployment
          name: istiod
          patches:
          - path: spec.template.spec.containers.[name:discovery].args[7]
            value: "--caCertTTL=8760h"
          - path: spec.template.spec.containers.[name:discovery].args[8]
            value: "--workloadCertTTL=24h"
  meshConfig:
    defaultConfig:
      proxyMetadata:
        ISTIO_META_CERT_ROTATION: "true"
        ISTIO_META_CERT_ROTATION_GRACE_PERIOD_RATIO: "0.2"

• Created a certificate monitoring solution:


// cert_monitor.go
package main
import (
	"context"
	"crypto/x509"
	"encoding/pem"
	"fmt"
	"log"
	"os"
	"time"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"net/http"
)
var (
	certExpiryDays = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "istio_cert_expiry_days",
			Help: "Days until certificate expiration",
		},
		[]string{"namespace", "secret_name", "cert_type"},
	)
	certRotationSuccess = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "istio_cert_rotation_success_total",
			Help: "Total number of successful certificate rotations",
		},
		[]string{"namespace", "secret_name"},
	)
	certRotationFailure = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "istio_cert_rotation_failure_total",
			Help: "Total number of failed certificate rotations",
		},
		[]string{"namespace", "secret_name", "reason"},
	)
)
func main() {
	// Set up Kubernetes client
	config, err := rest.InClusterConfig()
	if err != nil {
		log.Fatalf("Failed to get cluster config: %v", err)
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		log.Fatalf("Failed to create Kubernetes client: %v", err)
	}
	// Start HTTP server for Prometheus metrics
	http.Handle("/metrics", promhttp.Handler())
	go func() {
		log.Fatal(http.ListenAndServe(":8080", nil))
	}()
	// Monitor certificates
	monitorCertificates(clientset)
}
func monitorCertificates(clientset *kubernetes.Clientset) {
	for {
		// Get all namespaces
		namespaces, err := clientset.CoreV1().Namespaces().List(context.TODO(), metav1.ListOptions{})
		if err != nil {
			log.Printf("Failed to list namespaces: %v", err)
			time.Sleep(5 * time.Minute)
			continue
		}
		// Check certificates in each namespace
		for _, namespace := range namespaces.Items {
			ns := namespace.Name
			// Get all secrets in the namespace
			secrets, err := clientset.CoreV1().Secrets(ns).List(context.TODO(), metav1.ListOptions{})
			if err != nil {
				log.Printf("Failed to list secrets in namespace %s: %v", ns, err)
				continue
			}
			// Check each secret for certificates
			for _, secret := range secrets.Items {
				// Check if this is a TLS secret
				if secret.Type != "kubernetes.io/tls" && secret.Type != "istio.io/key-and-cert" {
					continue
				}
				// Check certificate data
				for key, data := range secret.Data {
					if key == "ca.crt" || key == "tls.crt" || key == "cert-chain.pem" || key == "root-cert.pem" {
						// Parse certificate
						block, _ := pem.Decode(data)
						if block == nil {
							log.Printf("Failed to decode PEM block from %s in secret %s/%s", key, ns, secret.Name)
							certRotationFailure.WithLabelValues(ns, secret.Name, "decode_failure").Inc()
							continue
						}
						cert, err := x509.ParseCertificate(block.Bytes)
						if err != nil {
							log.Printf("Failed to parse certificate from %s in secret %s/%s: %v", key, ns, secret.Name, err)
							certRotationFailure.WithLabelValues(ns, secret.Name, "parse_failure").Inc()
							continue
						}
						// Calculate days until expiration
						expiryDays := time.Until(cert.NotAfter).Hours() / 24
						certExpiryDays.WithLabelValues(ns, secret.Name, key).Set(expiryDays)
						// Log warning if certificate is expiring soon
						if expiryDays < 30 {
							log.Printf("WARNING: Certificate %s in secret %s/%s expires in %.1f days", key, ns, secret.Name, expiryDays)
						}
						// Check if certificate was recently rotated
						issuedDays := time.Since(cert.NotBefore).Hours() / 24
						if issuedDays < 1 {
							log.Printf("Certificate %s in secret %s/%s was recently rotated (%.1f hours ago)", key, ns, secret.Name, time.Since(cert.NotBefore).Hours())
							certRotationSuccess.WithLabelValues(ns, secret.Name).Inc()
						}
					}
				}
			}
		}
		// Sleep before next check
		time.Sleep(1 * time.Hour)
	}
}

• Implemented a certificate rotation testing procedure:


#!/bin/bash
# test_cert_rotation.sh
set -euo pipefail
NAMESPACE=${1:-istio-system}
SECRET_NAME=${2:-istio-ca-secret}
WORKLOAD_NAMESPACE=${3:-default}
WORKLOAD_NAME=${4:-sleep}
echo "Testing certificate rotation for Istio in namespace $NAMESPACE"
# Check istiod status
echo "Checking istiod status..."
kubectl get pods -n $NAMESPACE -l app=istiod
# Check current root certificate
echo "Checking current root certificate..."
kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data.root-cert\.pem}' | base64 -d | openssl x509 -noout -text | grep "Validity" -A 2
# Check workload certificates
echo "Checking workload certificates..."
POD_NAME=$(kubectl get pod -n $WORKLOAD_NAMESPACE -l app=$WORKLOAD_NAME -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n $WORKLOAD_NAMESPACE $POD_NAME -c istio-proxy -- ls -la /var/run/secrets/istio/
# Get certificate expiry
kubectl exec -n $WORKLOAD_NAMESPACE $POD_NAME -c istio-proxy -- cat /var/run/secrets/istio/cert-chain.pem | openssl x509 -noout -text | grep "Validity" -A 2
# Trigger certificate rotation
echo "Triggering certificate rotation..."
kubectl delete secret $SECRET_NAME -n $NAMESPACE
# Wait for istiod to restart
echo "Waiting for istiod to restart..."
kubectl rollout restart deployment/istiod -n $NAMESPACE
kubectl rollout status deployment/istiod -n $NAMESPACE
# Wait for workload certificates to be rotated
echo "Waiting for workload certificates to be rotated..."
sleep 60
# Verify new certificates
echo "Verifying new certificates..."
kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data.root-cert\.pem}' | base64 -d | openssl x509 -noout -text | grep "Validity" -A 2
# Verify workload certificates
echo "Verifying workload certificates..."
kubectl exec -n $WORKLOAD_NAMESPACE $POD_NAME -c istio-proxy -- cat /var/run/secrets/istio/cert-chain.pem | openssl x509 -noout -text | grep "Validity" -A 2
# Test connectivity
echo "Testing connectivity..."
kubectl exec -n $WORKLOAD_NAMESPACE $POD_NAME -c sleep -- curl -s httpbin.default:8000/headers | grep "X-Forwarded-Client-Cert"
echo "Certificate rotation test completed successfully"

Lessons Learned:

Certificate management in service meshes requires careful planning and monitoring.

How to Avoid:

Implement certificate monitoring with alerts for upcoming expirations.
Test certificate rotation procedures regularly in non-production environments.
Document certificate management procedures and automate where possible.
Use longer-lived root certificates and shorter-lived workload certificates.
Implement graceful certificate rotation with overlapping validity periods.

Answer 2

output:

API Gateway and Service Mesh Kong API Gateway, Kubernetes, Production environment

Summary:

No summary provided

What Happened:

A production API gateway started experiencing high CPU and memory usage, eventually leading to service degradation. The issue was traced to a single client making thousands of requests per second to a computationally expensive endpoint, despite rate limiting being configured.

Diagnosis Steps:

Analyzed API gateway logs and metrics to identify traffic patterns.
Examined rate limiting configuration and request headers.
Profiled API gateway performance during the incident.
Traced requests from the problematic client through the system.
Reviewed recent configuration changes to the API gateway.

Root Cause:

The client was able to bypass rate limiting by manipulating request headers. The rate limiting plugin was configured to use the X-Forwarded-For header for client identification, but the gateway was not validating or overwriting this header, allowing the client to spoof different IP addresses in each request.

Fix/Workaround:

• Short-term: Implemented immediate header validation and IP blocking:


# Kong API Gateway configuration patch
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: ip-restriction
spec:
  plugin: ip-restriction
  config:
    deny:
    - 203.0.113.0/24  # Malicious client IP range

• Reconfigured rate limiting to use multiple identifiers:


# Before: Vulnerable rate limiting configuration
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: rate-limiting
spec:
  plugin: rate-limiting
  config:
    minute: 60
    hour: 1000
    limit_by: ip
    policy: local
# After: Improved rate limiting configuration
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: rate-limiting
spec:
  plugin: rate-limiting
  config:
    minute: 60
    hour: 1000
    limit_by: consumer
    policy: redis
    redis_host: redis.infrastructure
    redis_port: 6379
    redis_timeout: 2000
    redis_database: 0
    hide_client_headers: false

• Long-term: Implemented a comprehensive API security strategy:


-- Custom Kong plugin for advanced request validation
-- save as kong/plugins/advanced-request-validator/handler.lua
local BasePlugin = require "kong.plugins.base_plugin"
local iputils = require "resty.iputils"
local jwt_decoder = require "kong.plugins.jwt.jwt_parser"
local AdvancedRequestValidator = BasePlugin:extend()
AdvancedRequestValidator.PRIORITY = 1005
AdvancedRequestValidator.VERSION = "1.0.0"
function AdvancedRequestValidator:new()
  AdvancedRequestValidator.super.new(self, "advanced-request-validator")
end
function AdvancedRequestValidator:access(conf)
  AdvancedRequestValidator.super.access(self)
  local request_headers = kong.request.get_headers()
  local client_ip = kong.client.get_forwarded_ip()
  local request_path = kong.request.get_path()
  local request_method = kong.request.get_method()
  -- 1. Validate and sanitize headers
  if request_headers["x-forwarded-for"] then
    -- Overwrite with trusted value from Kong
    kong.service.request.set_header("X-Forwarded-For", client_ip)
  end
  -- 2. Check for suspicious patterns
  local user_agent = request_headers["user-agent"]
  if not user_agent or user_agent == "" or string.find(user_agent:lower(), "bot") then
    kong.log.warn("Suspicious user agent detected: ", user_agent)
    -- Increment counter for this IP
    local suspicious_count = kong.ctx.shared.suspicious_count or 0
    suspicious_count = suspicious_count + 1
    kong.ctx.shared.suspicious_count = suspicious_count
    -- If multiple suspicious requests, add to temporary block list
    if suspicious_count > conf.suspicious_threshold then
      kong.log.err("Adding IP to temporary block list: ", client_ip)
      -- This would typically update a distributed cache or database
    end
  end
  -- 3. Validate JWT tokens if present
  local auth_header = request_headers["authorization"]
  if auth_header and auth_header:find("Bearer") == 1 then
    local token = auth_header:sub(8)
    local jwt, err = jwt_decoder:new(token)
    if err then
      kong.log.err("Invalid JWT: ", err)
      return kong.response.exit(401, { message = "Invalid authentication credentials" })
    end
    -- Check token claims
    local claims = jwt.claims
    if claims.exp and claims.exp < os.time() then
      return kong.response.exit(401, { message = "Token expired" })
    end
    -- Check if token is in deny list
    -- This would typically check a distributed cache or database
  end
  -- 4. Apply additional rate limiting for expensive endpoints
  if conf.expensive_endpoints[request_path] and request_method == "POST" then
    -- Apply stricter rate limits for expensive operations
    -- This would typically use a distributed counter
  end
end
return AdvancedRequestValidator

• Implemented a distributed rate limiting solution with Redis:


// rate_limiter.go
package main
import (
	"context"
	"fmt"
	"log"
	"net/http"
	"strconv"
	"time"
	"github.com/go-redis/redis/v8"
	"github.com/google/uuid"
)
// RateLimiter implements a distributed rate limiter using Redis
type RateLimiter struct {
	redisClient *redis.Client
	keyPrefix   string
	windowSize  time.Duration
	limit       int
}
// NewRateLimiter creates a new rate limiter
func NewRateLimiter(redisAddr, keyPrefix string, windowSize time.Duration, limit int) *RateLimiter {
	client := redis.NewClient(&redis.Options{
		Addr:     redisAddr,
		Password: "", // no password set
		DB:       0,  // use default DB
	})
	return &RateLimiter{
		redisClient: client,
		keyPrefix:   keyPrefix,
		windowSize:  windowSize,
		limit:       limit,
	}
}
// Allow checks if a request is allowed based on the rate limit
func (rl *RateLimiter) Allow(ctx context.Context, identifier string) (bool, int, error) {
	// Create a unique key for this identifier and window
	now := time.Now().UnixNano()
	windowStart := now - int64(rl.windowSize)
	key := fmt.Sprintf("%s:%s", rl.keyPrefix, identifier)
	// Use Redis pipeline for efficiency
	pipe := rl.redisClient.Pipeline()
	// Remove old entries outside the current window
	pipe.ZRemRangeByScore(ctx, key, "0", strconv.FormatInt(windowStart, 10))
	// Add current request with score as current timestamp
	requestID := uuid.New().String()
	pipe.ZAdd(ctx, key, &redis.Z{Score: float64(now), Member: requestID})
	// Get the count of requests in the current window
	countCmd := pipe.ZCard(ctx, key)
	// Set expiration on the key to clean up old data
	pipe.Expire(ctx, key, rl.windowSize*2)
	// Execute the pipeline
	_, err := pipe.Exec(ctx)
	if err != nil {
		return false, 0, err
	}
	// Get the count of requests in the current window
	count, err := countCmd.Result()
	if err != nil {
		return false, 0, err
	}
	// Check if the count exceeds the limit
	return count <= int64(rl.limit), int(count), nil
}
// RateLimitMiddleware creates a middleware for rate limiting
func RateLimitMiddleware(rl *RateLimiter, identifierFunc func(*http.Request) string) func(http.Handler) http.Handler {
	return func(next http.Handler) http.Handler {
		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			// Get identifier for this request
			identifier := identifierFunc(r)
			// Check if request is allowed
			allowed, count, err := rl.Allow(r.Context(), identifier)
			if err != nil {
				log.Printf("Rate limiter error: %v", err)
				http.Error(w, "Internal Server Error", http.StatusInternalServerError)
				return
			}
			// Set rate limit headers
			w.Header().Set("X-RateLimit-Limit", strconv.Itoa(rl.limit))
			w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(rl.limit-int(count)))
			w.Header().Set("X-RateLimit-Reset", strconv.FormatInt(time.Now().Add(rl.windowSize).Unix(), 10))
			if !allowed {
				w.Header().Set("Retry-After", strconv.Itoa(int(rl.windowSize.Seconds())))
				http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
				return
			}
			next.ServeHTTP(w, r)
		})
	}
}
// GetClientIdentifier returns a function that extracts a client identifier from a request
func GetClientIdentifier(useMultipleFactors bool) func(*http.Request) string {
	return func(r *http.Request) string {
		if !useMultipleFactors {
			// Simple IP-based identification
			return r.RemoteAddr
		}
		// Multi-factor identification
		userAgent := r.UserAgent()
		authHeader := r.Header.Get("Authorization")
		// Extract user ID from JWT if available
		userID := "anonymous"
		if authHeader != "" {
			// In a real implementation, this would parse and validate the JWT
			// For this example, we'll just use a placeholder
			userID = "user-from-jwt"
		}
		// Combine factors
		return fmt.Sprintf("%s:%s:%s", r.RemoteAddr, userAgent, userID)
	}
}
func main() {
	// Create a rate limiter with a 1-minute window and 60 requests limit
	rateLimiter := NewRateLimiter("localhost:6379", "ratelimit", time.Minute, 60)
	// Create a middleware that uses multiple factors for identification
	middleware := RateLimitMiddleware(rateLimiter, GetClientIdentifier(true))
	// Create a simple HTTP server with rate limiting
	http.Handle("/api/", middleware(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		fmt.Fprintf(w, "Hello, you've requested: %s\n", r.URL.Path)
	})))
	log.Println("Starting server on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

• Implemented a comprehensive API security monitoring system:


# Prometheus alerting rules for API security
groups:
- name: api_security
  rules:
  - alert: HighRateLimitViolations
    expr: sum(rate(kong_http_status{status="429"}[5m])) by (service) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High rate limit violations for {{ $labels.service }}"
      description: "Service {{ $labels.service }} is experiencing high rate limit violations ({{ $value }} per second)"
  - alert: UnusualRequestPatterns
    expr: sum(rate(http_requests_total{status=~"4.."}[5m])) by (service) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual request patterns for {{ $labels.service }}"
      description: "Service {{ $labels.service }} is receiving a high number of 4xx errors ({{ $value }} per second)"
  - alert: PotentialAPIScraping
    expr: sum(rate(http_requests_total[5m])) by (client_ip) > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Potential API scraping from {{ $labels.client_ip }}"
      description: "Client IP {{ $labels.client_ip }} is making a high number of requests ({{ $value }} per second)"
  - alert: APIGatewayHighCPU
    expr: avg(rate(container_cpu_usage_seconds_total{container=~"kong.*"}[5m])) by (pod) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "API Gateway pod {{ $labels.pod }} high CPU usage"
      description: "API Gateway pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of CPU"
  - alert: APIGatewayHighMemory
    expr: avg(container_memory_usage_bytes{container=~"kong.*"} / container_spec_memory_limit_bytes{container=~"kong.*"}) by (pod) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "API Gateway pod {{ $labels.pod }} high memory usage"
      description: "API Gateway pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of memory"

Lessons Learned:

API gateway security requires defense in depth with multiple validation layers.

How to Avoid:

Never trust client-provided headers for rate limiting or authentication.
Implement multiple identification factors for rate limiting.
Use distributed rate limiting for high-traffic APIs.
Monitor for unusual traffic patterns and implement automatic blocking.
Regularly audit API gateway configurations for security vulnerabilities.

Answer 3

output:

API Gateway and Service Mesh Kong API Gateway, Kubernetes, Production environment

Summary:

No summary provided

What Happened:

A company's payment processing API suddenly experienced a significant performance degradation, with response times increasing from milliseconds to several seconds. The operations team observed a massive spike in traffic to specific API endpoints. Despite having rate limiting configured in the Kong API gateway, the attacker was able to bypass these controls and overwhelm the backend services.

Diagnosis Steps:

Analyzed API gateway logs to identify traffic patterns.
Examined rate limiting configuration and behavior.
Reviewed client IP addresses and request headers.
Monitored backend service performance metrics.
Analyzed the attack pattern and request signatures.

Root Cause:

The investigation revealed multiple issues with the rate limiting implementation: 1. Rate limiting was configured based solely on client IP addresses 2. The attacker used multiple proxy servers to distribute requests across many source IPs 3. The API gateway was not configured to detect and block distributed attacks 4. Rate limits were set too high for sensitive endpoints 5. The API gateway wasn't validating API keys properly for some endpoints

Fix/Workaround:

• Short-term: Implemented immediate mitigations:


# Before: Simple IP-based rate limiting in Kong
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: rate-limiting
  namespace: api-gateway
config:
  minute: 60
  limit_by: ip
  policy: local
  fault_tolerant: true
  hide_client_headers: false
  redis_ssl: false
  redis_ssl_verify: false
# After: Enhanced rate limiting with multiple identifiers
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: advanced-rate-limiting
  namespace: api-gateway
config:
  minute: 30
  hour: 500
  limit_by: credential,header,ip
  header_name: X-Forwarded-For
  path: null
  policy: redis
  fault_tolerant: true
  hide_client_headers: false
  redis_host: redis.rate-limiting
  redis_port: 6379
  redis_password: ${REDIS_PASSWORD}
  redis_timeout: 2000
  redis_database: 0
  redis_ssl: true
  redis_ssl_verify: true

• Implemented request validation to prevent malformed requests:


apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: request-validator
  namespace: api-gateway
plugin: request-validator
config:
  body_schema: |
    {
      "type": "object",
      "properties": {
        "payment_id": { "type": "string", "pattern": "^[a-zA-Z0-9-_]+$" },
        "amount": { "type": "number", "minimum": 0.01 },
        "currency": { "type": "string", "enum": ["USD", "EUR", "GBP", "JPY"] },
        "description": { "type": "string", "maxLength": 255 }
      },
      "required": ["payment_id", "amount", "currency"]
    }
  verbose_response: false
  allowed_content_types:
    - application/json

• Long-term: Implemented a comprehensive API security solution:


// api_security.go - Advanced rate limiting and security service
package main
import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"os"
	"strconv"
	"strings"
	"time"
	"github.com/go-redis/redis/v8"
	"github.com/gorilla/mux"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Configuration constants
const (
	DefaultRateLimit       = 30
	DefaultRateLimitWindow = time.Minute
	DefaultBurstLimit      = 5
	DefaultBlockDuration   = 1 * time.Hour
	SuspiciousThreshold    = 3
	BlockThreshold         = 5
)
// Metrics
var (
	requestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "api_requests_total",
			Help: "Total number of API requests",
		},
		[]string{"path", "method", "status"},
	)
	requestsBlocked = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "api_requests_blocked",
			Help: "Total number of blocked API requests",
		},
		[]string{"path", "method", "reason"},
	)
	rateLimitExceeded = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "api_rate_limit_exceeded",
			Help: "Total number of rate limit exceeded events",
		},
		[]string{"path", "method", "client_id"},
	)
	clientBlockedCount = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "api_clients_blocked",
			Help: "Number of currently blocked clients",
		},
		[]string{"reason"},
	)
	requestLatency = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "api_request_duration_seconds",
			Help:    "API request latency in seconds",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"path", "method"},
	)
)
// ClientIdentifier contains all information used to identify a client
type ClientIdentifier struct {
	ClientIP        string
	APIKey          string
	UserAgent       string
	XForwardedFor   string
	SessionID       string
	AccountID       string
	RequestSignature string
}
// RateLimitConfig defines rate limiting parameters for an endpoint
type RateLimitConfig struct {
	Path           string
	Method         string
	Limit          int
	Window         time.Duration
	BurstLimit     int
	BlockDuration  time.Duration
	SensitiveEndpoint bool
}
// RateLimiter handles rate limiting logic
type RateLimiter struct {
	redisClient *redis.Client
	configs     map[string]RateLimitConfig
	defaultConfig RateLimitConfig
}
// NewRateLimiter creates a new rate limiter
func NewRateLimiter(redisAddr, redisPassword string, db int) (*RateLimiter, error) {
	client := redis.NewClient(&redis.Options{
		Addr:     redisAddr,
		Password: redisPassword,
		DB:       db,
	})
	// Test connection
	ctx := context.Background()
	_, err := client.Ping(ctx).Result()
	if err != nil {
		return nil, fmt.Errorf("failed to connect to Redis: %v", err)
	}
	return &RateLimiter{
		redisClient: client,
		configs:     make(map[string]RateLimitConfig),
		defaultConfig: RateLimitConfig{
			Limit:         DefaultRateLimit,
			Window:        DefaultRateLimitWindow,
			BurstLimit:    DefaultBurstLimit,
			BlockDuration: DefaultBlockDuration,
		},
	}, nil
}
// AddConfig adds a rate limit configuration for a specific endpoint
func (rl *RateLimiter) AddConfig(config RateLimitConfig) {
	key := fmt.Sprintf("%s:%s", config.Method, config.Path)
	rl.configs[key] = config
}
// getConfig returns the rate limit configuration for a specific endpoint
func (rl *RateLimiter) getConfig(method, path string) RateLimitConfig {
	key := fmt.Sprintf("%s:%s", method, path)
	if config, ok := rl.configs[key]; ok {
		return config
	}
	// Try with path pattern matching (simplified version)
	for configKey, config := range rl.configs {
		parts := strings.Split(configKey, ":")
		if len(parts) != 2 {
			continue
		}
		configMethod := parts[0]
		configPath := parts[1]
		// Skip if method doesn't match
		if configMethod != method && configMethod != "*" {
			continue
		}
		// Check if path matches pattern
		if strings.Contains(configPath, "*") {
			pattern := strings.Replace(configPath, "*", ".*", -1)
			// In a real implementation, use proper regex matching
			if strings.HasPrefix(path, strings.TrimSuffix(pattern, ".*")) {
				return config
			}
		}
	}
	return rl.defaultConfig
}
// generateClientKey creates a composite key for identifying a client
func generateClientKey(identifier ClientIdentifier) string {
	// Create a composite key using multiple identifiers
	components := []string{
		identifier.ClientIP,
		identifier.APIKey,
		identifier.XForwardedFor,
		identifier.AccountID,
	}
	// Filter out empty components
	var filteredComponents []string
	for _, component := range components {
		if component != "" {
			filteredComponents = append(filteredComponents, component)
		}
	}
	// If we have no components, use IP as fallback
	if len(filteredComponents) == 0 {
		return identifier.ClientIP
	}
	return strings.Join(filteredComponents, ":")
}
// generateRequestSignature creates a signature for the request to detect patterns
func generateRequestSignature(r *http.Request) string {
	// In a real implementation, this would create a hash of request characteristics
	// such as headers, query parameters, and payload structure
	return fmt.Sprintf("%s:%s", r.Method, r.URL.Path)
}
// CheckRateLimit checks if a request exceeds the rate limit
func (rl *RateLimiter) CheckRateLimit(ctx context.Context, identifier ClientIdentifier, method, path string) (bool, error) {
	config := rl.getConfig(method, path)
	clientKey := generateClientKey(identifier)
	// Check if client is blocked
	blockedKey := fmt.Sprintf("blocked:%s", clientKey)
	blocked, err := rl.redisClient.Exists(ctx, blockedKey).Result()
	if err != nil {
		return false, fmt.Errorf("failed to check if client is blocked: %v", err)
	}
	if blocked > 0 {
		// Client is blocked
		blockExpiration, err := rl.redisClient.TTL(ctx, blockedKey).Result()
		if err != nil {
			return false, fmt.Errorf("failed to get block expiration: %v", err)
		}
		log.Printf("Client %s is blocked for %v", clientKey, blockExpiration)
		requestsBlocked.WithLabelValues(path, method, "client_blocked").Inc()
		return false, nil
	}
	// Check rate limit
	windowKey := fmt.Sprintf("ratelimit:%s:%s:%s:%d", clientKey, method, path, time.Now().Unix()/int64(config.Window.Seconds()))
	// Increment counter
	count, err := rl.redisClient.Incr(ctx, windowKey).Result()
	if err != nil {
		return false, fmt.Errorf("failed to increment rate limit counter: %v", err)
	}
	// Set expiration if this is a new key
	if count == 1 {
		rl.redisClient.Expire(ctx, windowKey, config.Window)
	}
	// Check if rate limit is exceeded
	if count > int64(config.Limit) {
		// Record rate limit exceeded event
		rateLimitExceeded.WithLabelValues(path, method, clientKey).Inc()
		// Increment suspicious activity counter
		suspiciousKey := fmt.Sprintf("suspicious:%s", clientKey)
		suspiciousCount, err := rl.redisClient.Incr(ctx, suspiciousKey).Result()
		if err != nil {
			log.Printf("Failed to increment suspicious activity counter: %v", err)
		} else {
			// Set expiration if this is a new key
			if suspiciousCount == 1 {
				rl.redisClient.Expire(ctx, suspiciousKey, 24*time.Hour)
			}
			// Check if client should be blocked
			if suspiciousCount >= BlockThreshold {
				// Block client
				rl.redisClient.Set(ctx, blockedKey, "blocked", config.BlockDuration)
				log.Printf("Client %s has been blocked for %v due to excessive rate limit violations", clientKey, config.BlockDuration)
				clientBlockedCount.WithLabelValues("rate_limit_violations").Inc()
			} else if suspiciousCount >= SuspiciousThreshold {
				log.Printf("Client %s has suspicious activity (%d violations)", clientKey, suspiciousCount)
			}
		}
		return false, nil
	}
	// Check for distributed attacks by analyzing patterns across clients
	if config.SensitiveEndpoint {
		// Track request signature
		signatureKey := fmt.Sprintf("signature:%s:%d", identifier.RequestSignature, time.Now().Unix()/60)
		signatureCount, err := rl.redisClient.Incr(ctx, signatureKey).Result()
		if err != nil {
			log.Printf("Failed to track request signature: %v", err)
		} else {
			// Set expiration if this is a new key
			if signatureCount == 1 {
				rl.redisClient.Expire(ctx, signatureKey, 10*time.Minute)
			}
			// Check for distributed attack patterns
			if signatureCount > int64(config.Limit*3) {
				log.Printf("Potential distributed attack detected with signature %s (%d requests)", identifier.RequestSignature, signatureCount)
				requestsBlocked.WithLabelValues(path, method, "distributed_attack").Inc()
				return false, nil
			}
		}
	}
	return true, nil
}
// RateLimitMiddleware is a middleware that applies rate limiting
func (rl *RateLimiter) RateLimitMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()
		ctx := r.Context()
		// Extract client identifiers
		identifier := ClientIdentifier{
			ClientIP:      r.RemoteAddr,
			APIKey:        r.Header.Get("X-API-Key"),
			UserAgent:     r.Header.Get("User-Agent"),
			XForwardedFor: r.Header.Get("X-Forwarded-For"),
			SessionID:     r.Header.Get("X-Session-ID"),
			AccountID:     r.Header.Get("X-Account-ID"),
			RequestSignature: generateRequestSignature(r),
		}
		// Check rate limit
		allowed, err := rl.CheckRateLimit(ctx, identifier, r.Method, r.URL.Path)
		if err != nil {
			log.Printf("Rate limit check failed: %v", err)
			http.Error(w, "Internal Server Error", http.StatusInternalServerError)
			requestsTotal.WithLabelValues(r.URL.Path, r.Method, strconv.Itoa(http.StatusInternalServerError)).Inc()
			return
		}
		if !allowed {
			http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
			requestsTotal.WithLabelValues(r.URL.Path, r.Method, strconv.Itoa(http.StatusTooManyRequests)).Inc()
			return
		}
		// Call the next handler
		next.ServeHTTP(w, r)
		// Record metrics
		duration := time.Since(start).Seconds()
		requestLatency.WithLabelValues(r.URL.Path, r.Method).Observe(duration)
	})
}
func main() {
	// Initialize rate limiter
	redisAddr := os.Getenv("REDIS_ADDR")
	if redisAddr == "" {
		redisAddr = "localhost:6379"
	}
	redisPassword := os.Getenv("REDIS_PASSWORD")
	redisDB := 0
	if dbStr := os.Getenv("REDIS_DB"); dbStr != "" {
		var err error
		redisDB, err = strconv.Atoi(dbStr)
		if err != nil {
			log.Fatalf("Invalid REDIS_DB value: %v", err)
		}
	}
	rateLimiter, err := NewRateLimiter(redisAddr, redisPassword, redisDB)
	if err != nil {
		log.Fatalf("Failed to initialize rate limiter: %v", err)
	}
	// Configure rate limits for different endpoints
	rateLimiter.AddConfig(RateLimitConfig{
		Path:           "/api/v1/payments",
		Method:         "POST",
		Limit:          10,
		Window:         time.Minute,
		BurstLimit:     2,
		BlockDuration:  2 * time.Hour,
		SensitiveEndpoint: true,
	})
	rateLimiter.AddConfig(RateLimitConfig{
		Path:           "/api/v1/accounts/*",
		Method:         "GET",
		Limit:          100,
		Window:         time.Minute,
		BurstLimit:     10,
		BlockDuration:  1 * time.Hour,
		SensitiveEndpoint: true,
	})
	rateLimiter.AddConfig(RateLimitConfig{
		Path:           "/api/v1/products",
		Method:         "GET",
		Limit:          300,
		Window:         time.Minute,
		BurstLimit:     30,
		BlockDuration:  30 * time.Minute,
		SensitiveEndpoint: false,
	})
	// Create router
	router := mux.NewRouter()
	// Add metrics endpoint
	router.Path("/metrics").Handler(promhttp.Handler())
	// Add API endpoints
	apiRouter := router.PathPrefix("/api/v1").Subrouter()
	apiRouter.Use(rateLimiter.RateLimitMiddleware)
	// Example API endpoints
	apiRouter.HandleFunc("/payments", func(w http.ResponseWriter, r *http.Request) {
		// Process payment
		w.Header().Set("Content-Type", "application/json")
		json.NewEncoder(w).Encode(map[string]string{"status": "success"})
		requestsTotal.WithLabelValues(r.URL.Path, r.Method, "200").Inc()
	}).Methods("POST")
	apiRouter.HandleFunc("/accounts/{id}", func(w http.ResponseWriter, r *http.Request) {
		// Get account details
		w.Header().Set("Content-Type", "application/json")
		json.NewEncoder(w).Encode(map[string]string{"account_id": "123", "status": "active"})
		requestsTotal.WithLabelValues(r.URL.Path, r.Method, "200").Inc()
	}).Methods("GET")
	apiRouter.HandleFunc("/products", func(w http.ResponseWriter, r *http.Request) {
		// Get products
		w.Header().Set("Content-Type", "application/json")
		json.NewEncoder(w).Encode([]map[string]string{
			{"id": "1", "name": "Product 1"},
			{"id": "2", "name": "Product 2"},
		})
		requestsTotal.WithLabelValues(r.URL.Path, r.Method, "200").Inc()
	}).Methods("GET")
	// Start server
	port := os.Getenv("PORT")
	if port == "" {
		port = "8080"
	}
	log.Printf("Starting server on :%s", port)
	log.Fatal(http.ListenAndServe(":"+port, router))
}

• Implemented a comprehensive API security monitoring dashboard:


// api_security_dashboard.ts
import React, { useState, useEffect } from 'react';
import { 
  LineChart, Line, BarChart, Bar, PieChart, Pie, 
  XAxis, YAxis, CartesianGrid, Tooltip, Legend, 
  ResponsiveContainer, Cell 
} from 'recharts';
import { 
  Card, CardContent, Typography, Grid, 
  Select, MenuItem, FormControl, InputLabel,
  Button, Tabs, Tab, Box, Table, TableBody,
  TableCell, TableContainer, TableHead, TableRow,
  Paper, Chip
} from '@material-ui/core';
import { DatePicker } from '@material-ui/pickers';
interface ApiMetrics {
  timestamp: string;
  requestsTotal: number;
  requestsBlocked: number;
  rateLimitExceeded: number;
  averageLatency: number;
  p95Latency: number;
  p99Latency: number;
  clientsBlocked: number;
  suspiciousActivities: number;
}
interface EndpointMetrics {
  path: string;
  method: string;
  requestsTotal: number;
  requestsBlocked: number;
  rateLimitExceeded: number;
  averageLatency: number;
  errorRate: number;
}
interface BlockedClient {
  clientId: string;
  blockedSince: string;
  blockedUntil: string;
  reason: string;
  violationCount: number;
  endpoints: string[];
}
interface SecurityEvent {
  timestamp: string;
  eventType: string;
  clientId: string;
  path: string;
  method: string;
  description: string;
  severity: 'low' | 'medium' | 'high' | 'critical';
}
const COLORS = ['#0088FE', '#00C49F', '#FFBB28', '#FF8042', '#8884D8'];
const ApiSecurityDashboard: React.FC = () => {
  const [timeRange, setTimeRange] = useState<string>('24h');
  const [startDate, setStartDate] = useState<Date | null>(null);
  const [endDate, setEndDate] = useState<Date | null>(null);
  const [apiMetrics, setApiMetrics] = useState<ApiMetrics[]>([]);
  const [endpointMetrics, setEndpointMetrics] = useState<EndpointMetrics[]>([]);
  const [blockedClients, setBlockedClients] = useState<BlockedClient[]>([]);
  const [securityEvents, setSecurityEvents] = useState<SecurityEvent[]>([]);
  const [tabValue, setTabValue] = useState(0);
  useEffect(() => {
    // Fetch initial data
    fetchMetrics();
  }, []);
  useEffect(() => {
    // Fetch data when filters change
    fetchMetrics();
  }, [timeRange, startDate, endDate]);
  const fetchMetrics = async () => {
    // In a real implementation, this would call an API with the selected filters
    // For this example, we'll use mock data
    // Mock API metrics
    const mockApiMetrics: ApiMetrics[] = Array.from({ length: 24 }, (_, i) => {
      const date = new Date();
      date.setHours(date.getHours() - (23 - i));
      return {
        timestamp: date.toISOString(),
        requestsTotal: 1000 + Math.floor(Math.random() * 500),
        requestsBlocked: Math.floor(Math.random() * 50),
        rateLimitExceeded: Math.floor(Math.random() * 30),
        averageLatency: 50 + Math.random() * 30,
        p95Latency: 100 + Math.random() * 50,
        p99Latency: 200 + Math.random() * 100,
        clientsBlocked: Math.floor(Math.random() * 5),
        suspiciousActivities: Math.floor(Math.random() * 10)
      };
    });
    // Mock endpoint metrics
    const mockEndpointMetrics: EndpointMetrics[] = [
      {
        path: "/api/v1/payments",
        method: "POST",
        requestsTotal: 5432,
        requestsBlocked: 123,
        rateLimitExceeded: 87,
        averageLatency: 78.5,
        errorRate: 2.3
      },
      {
        path: "/api/v1/accounts/{id}",
        method: "GET",
        requestsTotal: 12543,
        requestsBlocked: 234,
        rateLimitExceeded: 156,
        averageLatency: 45.2,
        errorRate: 1.8
      },
      {
        path: "/api/v1/products",
        method: "GET",
        requestsTotal: 28765,
        requestsBlocked: 89,
        rateLimitExceeded: 45,
        averageLatency: 32.1,
        errorRate: 0.7
      },
      {
        path: "/api/v1/orders",
        method: "POST",
        requestsTotal: 3421,
        requestsBlocked: 67,
        rateLimitExceeded: 42,
        averageLatency: 85.3,
        errorRate: 1.5
      },
      {
        path: "/api/v1/users/{id}",
        method: "GET",
        requestsTotal: 8765,
        requestsBlocked: 45,
        rateLimitExceeded: 23,
        averageLatency: 38.7,
        errorRate: 0.9
      }
    ];
    // Mock blocked clients
    const mockBlockedClients: BlockedClient[] = [
      {
        clientId: "192.168.1.100:api_key_123",
        blockedSince: "2023-05-01T10:23:45Z",
        blockedUntil: "2023-05-01T12:23:45Z",
        reason: "Rate limit exceeded",
        violationCount: 12,
        endpoints: ["/api/v1/payments", "/api/v1/accounts/{id}"]
      },
      {
        clientId: "192.168.1.101:api_key_456",
        blockedSince: "2023-05-01T11:15:22Z",
        blockedUntil: "2023-05-01T13:15:22Z",
        reason: "Suspicious activity",
        violationCount: 8,
        endpoints: ["/api/v1/payments"]
      },
      {
        clientId: "192.168.1.102:api_key_789",
        blockedSince: "2023-05-01T09:45:12Z",
        blockedUntil: "2023-05-01T11:45:12Z",
        reason: "Distributed attack",
        violationCount: 15,
        endpoints: ["/api/v1/accounts/{id}", "/api/v1/users/{id}"]
      }
    ];
    // Mock security events
    const mockSecurityEvents: SecurityEvent[] = [
      {
        timestamp: "2023-05-01T10:23:45Z",
        eventType: "RATE_LIMIT_EXCEEDED",
        clientId: "192.168.1.100:api_key_123",
        path: "/api/v1/payments",
        method: "POST",
        description: "Client exceeded rate limit (10 requests/minute)",
        severity: "medium"
      },
      {
        timestamp: "2023-05-01T11:15:22Z",
        eventType: "SUSPICIOUS_ACTIVITY",
        clientId: "192.168.1.101:api_key_456",
        path: "/api/v1/payments",
        method: "POST",
        description: "Multiple failed payment attempts with invalid data",
        severity: "high"
      },
      {
        timestamp: "2023-05-01T09:45:12Z",
        eventType: "DISTRIBUTED_ATTACK",
        clientId: "192.168.1.102:api_key_789",
        path: "/api/v1/accounts/{id}",
        method: "GET",
        description: "Distributed attack detected from multiple IPs with same request pattern",
        severity: "critical"
      },
      {
        timestamp: "2023-05-01T08:32:18Z",
        eventType: "INVALID_API_KEY",
        clientId: "192.168.1.103",
        path: "/api/v1/orders",
        method: "POST",
        description: "Multiple requests with invalid API keys",
        severity: "low"
      },
      {
        timestamp: "2023-05-01T12:05:33Z",
        eventType: "PAYLOAD_ATTACK",
        clientId: "192.168.1.104:api_key_321",
        path: "/api/v1/users/{id}",
        method: "PUT",
        description: "Potential SQL injection attempt in request payload",
        severity: "critical"
      }
    ];
    setApiMetrics(mockApiMetrics);
    setEndpointMetrics(mockEndpointMetrics);
    setBlockedClients(mockBlockedClients);
    setSecurityEvents(mockSecurityEvents);
  };
  const handleTimeRangeChange = (event: React.ChangeEvent<{ value: unknown }>) => {
    setTimeRange(event.target.value as string);
  };
  const handleTabChange = (event: React.ChangeEvent<{}>, newValue: number) => {
    setTabValue(newValue);
  };
  const renderOverviewTab = () => (
    <Grid container spacing={3}>
      <Grid item xs={12} md={6}>
        <Card>
          <CardContent>
            <Typography variant="h6">API Requests</Typography>
            <ResponsiveContainer width="100%" height={300}>
              <LineChart data={apiMetrics}>
                <CartesianGrid strokeDasharray="3 3" />
                <XAxis 
                  dataKey="timestamp" 
                  tickFormatter={(timestamp) => new Date(timestamp).toLocaleTimeString()} 
                />
                <YAxis />
                <Tooltip 
                  labelFormatter={(timestamp) => new Date(timestamp).toLocaleString()} 
                />
                <Legend />
                <Line type="monotone" dataKey="requestsTotal" name="Total Requests" stroke="#8884d8" />
                <Line type="monotone" dataKey="requestsBlocked" name="Blocked Requests" stroke="#ff8042" />
              </LineChart>
            </ResponsiveContainer>
          </CardContent>
        </Card>
      </Grid>
      <Grid item xs={12} md={6}>
        <Card>
          <CardContent>
            <Typography variant="h6">API Latency</Typography>
            <ResponsiveContainer width="100%" height={300}>
              <LineChart data={apiMetrics}>
                <CartesianGrid strokeDasharray="3 3" />
                <XAxis 
                  dataKey="timestamp" 
                  tickFormatter={(timestamp) => new Date(timestamp).toLocaleTimeString()} 
                />
                <YAxis />
                <Tooltip 
                  labelFormatter={(timestamp) => new Date(timestamp).toLocaleString()} 
                />
                <Legend />
                <Line type="monotone" dataKey="averageLatency" name="Avg Latency (ms)" stroke="#8884d8" />
                <Line type="monotone" dataKey="p95Latency" name="P95 Latency (ms)" stroke="#82ca9d" />
                <Line type="monotone" dataKey="p99Latency" name="P99 Latency (ms)" stroke="#ff8042" />
              </LineChart>
            </ResponsiveContainer>
          </CardContent>
        </Card>
      </Grid>
      <Grid item xs={12} md={6}>
        <Card>
          <CardContent>
            <Typography variant="h6">Rate Limiting</Typography>
            <ResponsiveContainer width="100%" height={300}>
              <LineChart data={apiMetrics}>
                <CartesianGrid strokeDasharray="3 3" />
                <XAxis 
                  dataKey="timestamp" 
                  tickFormatter={(timestamp) => new Date(timestamp).toLocaleTimeString()} 
                />
                <YAxis />
                <Tooltip 
                  labelFormatter={(timestamp) => new Date(timestamp).toLocaleString()} 
                />
                <Legend />
                <Line type="monotone" dataKey="rateLimitExceeded" name="Rate Limit Exceeded" stroke="#8884d8" />
                <Line type="monotone" dataKey="clientsBlocked" name="Clients Blocked" stroke="#ff8042" />
              </LineChart>
            </ResponsiveContainer>
          </CardContent>
        </Card>
      </Grid>
      <Grid item xs={12} md={6}>
        <Card>
          <CardContent>
            <Typography variant="h6">Top Endpoints by Traffic</Typography>
            <ResponsiveContainer width="100%" height={300}>
              <BarChart data={endpointMetrics}>
                <CartesianGrid strokeDasharray="3 3" />
                <XAxis dataKey="path" />
                <YAxis />
                <Tooltip />
                <Legend />
                <Bar dataKey="requestsTotal" name="Total Requests" fill="#8884d8" />
              </BarChart>
            </ResponsiveContainer>
          </CardContent>
        </Card>
      </Grid>
    </Grid>
  );
  const renderEndpointsTab = () => (
    <Grid container spacing={3}>
      <Grid item xs={12}>
        <TableContainer component={Paper}>
          <Table>
            <TableHead>
              <TableRow>
                <TableCell>Endpoint</TableCell>
                <TableCell>Method</TableCell>
                <TableCell align="right">Total Requests</TableCell>
                <TableCell align="right">Blocked Requests</TableCell>
                <TableCell align="right">Rate Limit Exceeded</TableCell>
                <TableCell align="right">Avg Latency (ms)</TableCell>
                <TableCell align="right">Error Rate (%)</TableCell>
              </TableRow>
            </TableHead>
            <TableBody>
              {endpointMetrics.map((endpoint) => (
                <TableRow key={`${endpoint.method}-${endpoint.path}`}>
                  <TableCell>{endpoint.path}</TableCell>
                  <TableCell>
                    <Chip 
                      label={endpoint.method} 
                      color={
                        endpoint.method === "GET" ? "primary" : 
                        endpoint.method === "POST" ? "secondary" : 
                        "default"
                      } 
                      size="small" 
                    />
                  </TableCell>
                  <TableCell align="right">{endpoint.requestsTotal.toLocaleString()}</TableCell>
                  <TableCell align="right">{endpoint.requestsBlocked.toLocaleString()}</TableCell>
                  <TableCell align="right">{endpoint.rateLimitExceeded.toLocaleString()}</TableCell>
                  <TableCell align="right">{endpoint.averageLatency.toFixed(1)}</TableCell>
                  <TableCell align="right">{endpoint.errorRate.toFixed(1)}%</TableCell>
                </TableRow>
              ))}
            </TableBody>
          </Table>
        </TableContainer>
      </Grid>
      <Grid item xs={12} md={6}>
        <Card>
          <CardContent>
            <Typography variant="h6">Blocked Requests by Endpoint</Typography>
            <ResponsiveContainer width="100%" height={300}>
              <BarChart data={endpointMetrics}>
                <CartesianGrid strokeDasharray="3 3" />
                <XAxis dataKey="path" />
                <YAxis />
                <Tooltip />
                <Legend />
                <Bar dataKey="requestsBlocked" name="Blocked Requests" fill="#ff8042" />
              </BarChart>
            </ResponsiveContainer>
          </CardContent>
        </Card>
      </Grid>
      <Grid item xs={12} md={6}>
        <Card>
          <CardContent>
            <Typography variant="h6">Rate Limit Exceeded by Endpoint</Typography>
            <ResponsiveContainer width="100%" height={300}>
              <BarChart data={endpointMetrics}>
                <CartesianGrid strokeDasharray="3 3" />
                <XAxis dataKey="path" />
                <YAxis />
                <Tooltip />
                <Legend />
                <Bar dataKey="rateLimitExceeded" name="Rate Limit Exceeded" fill="#8884d8" />
              </BarChart>
            </ResponsiveContainer>
          </CardContent>
        </Card>
      </Grid>
    </Grid>
  );
  const renderSecurityTab = () => (
    <Grid container spacing={3}>
      <Grid item xs={12}>
        <Typography variant="h6">Blocked Clients</Typography>
        <TableContainer component={Paper}>
          <Table>
            <TableHead>
              <TableRow>
                <TableCell>Client ID</TableCell>
                <TableCell>Blocked Since</TableCell>
                <TableCell>Blocked Until</TableCell>
                <TableCell>Reason</TableCell>
                <TableCell align="right">Violation Count</TableCell>
                <TableCell>Affected Endpoints</TableCell>
              </TableRow>
            </TableHead>
            <TableBody>
              {blockedClients.map((client) => (
                <TableRow key={client.clientId}>
                  <TableCell>{client.clientId}</TableCell>
                  <TableCell>{new Date(client.blockedSince).toLocaleString()}</TableCell>
                  <TableCell>{new Date(client.blockedUntil).toLocaleString()}</TableCell>
                  <TableCell>{client.reason}</TableCell>
                  <TableCell align="right">{client.violationCount}</TableCell>
                  <TableCell>
                    {client.endpoints.map((endpoint) => (
                      <Chip 
                        key={endpoint} 
                        label={endpoint} 
                        size="small" 
                        style={{ margin: 2 }} 
                      />
                    ))}
                  </TableCell>
                </TableRow>
              ))}
            </TableBody>
          </Table>
        </TableContainer>
      </Grid>
      <Grid item xs={12} style={{ marginTop: 20 }}>
        <Typography variant="h6">Security Events</Typography>
        <TableContainer component={Paper}>
          <Table>
            <TableHead>
              <TableRow>
                <TableCell>Timestamp</TableCell>
                <TableCell>Event Type</TableCell>
                <TableCell>Client ID</TableCell>
                <TableCell>Endpoint</TableCell>
                <TableCell>Method</TableCell>
                <TableCell>Description</TableCell>
                <TableCell>Severity</TableCell>
              </TableRow>
            </TableHead>
            <TableBody>
              {securityEvents.map((event, index) => (
                <TableRow key={index}>
                  <TableCell>{new Date(event.timestamp).toLocaleString()}</TableCell>
                  <TableCell>{event.eventType}</TableCell>
                  <TableCell>{event.clientId}</TableCell>
                  <TableCell>{event.path}</TableCell>
                  <TableCell>
                    <Chip 
                      label={event.method} 
                      color={
                        event.method === "GET" ? "primary" : 
                        event.method === "POST" ? "secondary" : 
                        "default"
                      } 
                      size="small" 
                    />
                  </TableCell>
                  <TableCell>{event.description}</TableCell>
                  <TableCell>
                    <Chip 
                      label={event.severity} 
                      color={
                        event.severity === "low" ? "default" : 
                        event.severity === "medium" ? "primary" : 
                        event.severity === "high" ? "secondary" : 
                        "error"
                      } 
                      size="small" 
                    />
                  </TableCell>
                </TableRow>
              ))}
            </TableBody>
          </Table>
        </TableContainer>
      </Grid>
    </Grid>
  );
  return (
    <div>
      <Typography variant="h4" gutterBottom>
        API Security Dashboard
      </Typography>
      <Grid container spacing={3} style={{ marginBottom: 20 }}>
        <Grid item xs={12} md={3}>
          <FormControl fullWidth>
            <InputLabel>Time Range</InputLabel>
            <Select value={timeRange} onChange={handleTimeRangeChange}>
              <MenuItem value="1h">Last Hour</MenuItem>
              <MenuItem value="6h">Last 6 Hours</MenuItem>
              <MenuItem value="24h">Last 24 Hours</MenuItem>
              <MenuItem value="7d">Last 7 Days</MenuItem>
              <MenuItem value="30d">Last 30 Days</MenuItem>
              <MenuItem value="custom">Custom Range</MenuItem>
            </Select>
          </FormControl>
        </Grid>
        {timeRange === 'custom' && (
          <>
            <Grid item xs={12} md={3}>
              <DatePicker
                label="Start Date"
                value={startDate}
                onChange={setStartDate}
                renderInput={(props) => <TextField {...props} fullWidth />}
              />
            </Grid>
            <Grid item xs={12} md={3}>
              <DatePicker
                label="End Date"
                value={endDate}
                onChange={setEndDate}
                renderInput={(props) => <TextField {...props} fullWidth />}
              />
            </Grid>
          </>
        )}
        <Grid item xs={12} md={3}>
          <Button variant="contained" color="primary" fullWidth onClick={fetchMetrics}>
            Refresh Data
          </Button>
        </Grid>
      </Grid>
      <Tabs value={tabValue} onChange={handleTabChange} aria-label="api security tabs">
        <Tab label="Overview" />
        <Tab label="Endpoints" />
        <Tab label="Security" />
      </Tabs>
      <Box mt={3}>
        {tabValue === 0 && renderOverviewTab()}
        {tabValue === 1 && renderEndpointsTab()}
        {tabValue === 2 && renderSecurityTab()}
      </Box>
    </div>
  );
};
export default ApiSecurityDashboard;

Lessons Learned:

Effective API security requires a multi-layered approach beyond simple rate limiting.

How to Avoid:

Implement rate limiting based on multiple client identifiers.
Use distributed rate limiting with a shared data store.
Configure different rate limits for different endpoints based on sensitivity.
Implement request validation to prevent malformed requests.
Monitor for distributed attack patterns across multiple clients.

Answer 4

output:

API Gateway and Service Mesh Kong API Gateway, Kubernetes, Production environment

Summary:

No summary provided

What Happened:

During a marketing campaign, the company's backend services experienced severe performance degradation despite having rate limiting configured in the API gateway. The incident began when response times for all API endpoints increased dramatically, eventually leading to 503 errors for many users. The operations team initially suspected a DDoS attack but later discovered that legitimate traffic from specific clients was overwhelming the backend services.

Diagnosis Steps:

Analyzed API gateway logs to identify traffic patterns.
Examined rate limiting configurations across all routes.
Reviewed client request headers and authentication methods.
Monitored backend service resource utilization.
Tested rate limiting with different client configurations.

Root Cause:

The investigation revealed multiple issues: 1. Rate limiting was configured based only on client IP addresses 2. Many users were accessing the API through corporate proxies, appearing as a single IP 3. The API gateway was not configured to use API keys or tokens for rate limiting 4. Some internal services were whitelisted from rate limiting entirely 5. The rate limiting plugin configuration had inconsistencies across different routes

Fix/Workaround:

• Short-term: Implemented emergency rate limiting based on both IP and authentication tokens

• Adjusted backend service scaling parameters to handle the increased load

• Created a comprehensive Kong API Gateway configuration with proper rate limiting:


# kong.yaml - Proper rate limiting configuration
_format_version: "2.1"
_transform: true
services:
  - name: user-service
    url: http://user-service.internal:8080
    plugins:
      - name: rate-limiting
        config:
          minute: 60
          hour: 1000
          day: 10000
          policy: redis
          redis_host: redis.internal
          redis_port: 6379
          redis_timeout: 2000
          redis_database: 0
          hide_client_headers: false
          identifier: consumer
          sync_rate: -1
          namespace: user-service
    routes:
      - name: user-api
        paths:
          - /api/v1/users
        strip_path: false
        preserve_host: true
        protocols:
          - http
          - https
  - name: product-service
    url: http://product-service.internal:8080
    plugins:
      - name: rate-limiting
        config:
          minute: 120
          hour: 2000
          day: 20000
          policy: redis
          redis_host: redis.internal
          redis_port: 6379
          redis_timeout: 2000
          redis_database: 0
          hide_client_headers: false
          identifier: consumer
          sync_rate: -1
          namespace: product-service
    routes:
      - name: product-api
        paths:
          - /api/v1/products
        strip_path: false
        preserve_host: true
        protocols:
          - http
          - https
consumers:
  - username: mobile-app
    custom_id: mobile-app-client
    plugins:
      - name: rate-limiting
        config:
          minute: 30
          hour: 500
          day: 5000
          policy: redis
          redis_host: redis.internal
          redis_port: 6379
          redis_timeout: 2000
          redis_database: 0
          hide_client_headers: false
  - username: web-app
    custom_id: web-app-client
    plugins:
      - name: rate-limiting
        config:
          minute: 60
          hour: 1000
          day: 10000
          policy: redis
          redis_host: redis.internal
          redis_port: 6379
          redis_timeout: 2000
          redis_database: 0
          hide_client_headers: false
  - username: partner-api
    custom_id: partner-api-client
    plugins:
      - name: rate-limiting
        config:
          minute: 120
          hour: 2000
          day: 20000
          policy: redis
          redis_host: redis.internal
          redis_port: 6379
          redis_timeout: 2000
          redis_database: 0
          hide_client_headers: false
  - username: internal-service
    custom_id: internal-service-client
    plugins:
      - name: rate-limiting
        config:
          minute: 600
          hour: 10000
          day: 100000
          policy: redis
          redis_host: redis.internal
          redis_port: 6379
          redis_timeout: 2000
          redis_database: 0
          hide_client_headers: false
plugins:
  - name: key-auth
    config:
      key_names:
        - apikey
      hide_credentials: true
  - name: cors
    config:
      origins:
        - "*"
      methods:
        - GET
        - POST
        - PUT
        - DELETE
        - OPTIONS
      headers:
        - Accept
        - Accept-Version
        - Content-Length
        - Content-MD5
        - Content-Type
        - Date
        - X-Auth-Token
      exposed_headers:
        - X-Auth-Token
      credentials: true
      max_age: 3600
      preflight_continue: false
  - name: prometheus
    config:
      status_code_metrics: true
      latency_metrics: true
      upstream_health_metrics: true
      bandwidth_metrics: true
  - name: request-transformer
    config:
      add:
        headers:
          - X-Request-ID:$(uuid)

• Implemented a custom rate limiting plugin in Lua for more advanced scenarios:


-- advanced-rate-limiting.lua
local redis = require "resty.redis"
local cjson = require "cjson"
local timestamp = require "kong.tools.timestamp"
local counter = require "kong.tools.counter"
local kong = kong
local AdvanacedRateLimiting = {}
AdvanacedRateLimiting.PRIORITY = 901
AdvanacedRateLimiting.VERSION = "1.0.0"
local EMPTY = {}
local EXPIRATIONS = {
  second = 1,
  minute = 60,
  hour = 3600,
  day = 86400,
  month = 2592000,
  year = 31536000,
}
local function get_identifier(conf)
  local identifier
  -- Use consumer id if available
  if conf.identifier == "consumer" then
    identifier = (kong.client.get_consumer() or EMPTY).id
    if not identifier and conf.fallback_to_ip then
      identifier = kong.client.get_forwarded_ip()
    end
  -- Use credential id if available
  elseif conf.identifier == "credential" then
    local credential = kong.client.get_credential()
    identifier = credential and credential.id
    if not identifier and conf.fallback_to_ip then
      identifier = kong.client.get_forwarded_ip()
    end
  -- Use custom header if specified
  elseif conf.identifier == "header" then
    identifier = kong.request.get_header(conf.header_name)
    if not identifier and conf.fallback_to_ip then
      identifier = kong.client.get_forwarded_ip()
    end
  -- Default to IP address
  else
    identifier = kong.client.get_forwarded_ip()
  end
  return identifier
end
local function get_usage(conf, identifier, current_timestamp, limits)
  local usage = {}
  local stop_on_error = conf.fault_tolerant ~= true
  -- Connect to Redis
  local red = redis:new()
  red:set_timeout(conf.redis_timeout)
  local ok, err = red:connect(conf.redis_host, conf.redis_port)
  if not ok then
    kong.log.err("failed to connect to Redis: ", err)
    return nil, err
  end
  if conf.redis_password and conf.redis_password ~= "" then
    local ok, err = red:auth(conf.redis_password)
    if not ok then
      kong.log.err("failed to authenticate with Redis: ", err)
      return nil, err
    end
  end
  if conf.redis_database ~= 0 then
    local ok, err = red:select(conf.redis_database)
    if not ok then
      kong.log.err("failed to select Redis database: ", err)
      return nil, err
    end
  end
  -- Check each limit
  for period, limit in pairs(limits) do
    local cache_key = "ratelimit:" .. identifier .. ":" .. period .. ":" .. conf.namespace
    local current_usage, err = red:get(cache_key)
    if err then
      kong.log.err("failed to get current usage: ", err)
      if stop_on_error then
        return nil, err
      end
      usage[period] = {limit = limit, remaining = 0}
    end
    -- If no usage found, initialize it
    if not current_usage then
      current_usage = 0
    end
    -- Calculate remaining
    local remaining = math.max(0, limit - tonumber(current_usage))
    -- Add to usage table
    usage[period] = {
      limit = limit,
      remaining = remaining,
      usage = tonumber(current_usage),
    }
  end
  -- Put Redis connection back to pool
  local ok, err = red:set_keepalive(10000, 100)
  if not ok then
    kong.log.err("failed to set Redis keepalive: ", err)
  end
  return usage
end
local function increment_usage(conf, identifier, current_timestamp, limits, delta)
  local stop_on_error = conf.fault_tolerant ~= true
  -- Connect to Redis
  local red = redis:new()
  red:set_timeout(conf.redis_timeout)
  local ok, err = red:connect(conf.redis_host, conf.redis_port)
  if not ok then
    kong.log.err("failed to connect to Redis: ", err)
    return nil, err
  end
  if conf.redis_password and conf.redis_password ~= "" then
    local ok, err = red:auth(conf.redis_password)
    if not ok then
      kong.log.err("failed to authenticate with Redis: ", err)
      return nil, err
    end
  end
  if conf.redis_database ~= 0 then
    local ok, err = red:select(conf.redis_database)
    if not ok then
      kong.log.err("failed to select Redis database: ", err)
      return nil, err
    end
  end
  -- Start Redis pipeline
  red:init_pipeline()
  -- Increment each limit
  for period, limit in pairs(limits) do
    local cache_key = "ratelimit:" .. identifier .. ":" .. period .. ":" .. conf.namespace
    local expiration = EXPIRATIONS[period]
    red:incrby(cache_key, delta)
    red:expire(cache_key, expiration)
  end
  -- Execute pipeline
  local _, err = red:commit_pipeline()
  if err then
    kong.log.err("failed to commit Redis pipeline: ", err)
    if stop_on_error then
      return nil, err
    end
  end
  -- Put Redis connection back to pool
  local ok, err = red:set_keepalive(10000, 100)
  if not ok then
    kong.log.err("failed to set Redis keepalive: ", err)
  end
  return true
end
function AdvanacedRateLimiting:access(conf)
  -- Get current timestamp
  local current_timestamp = timestamp.get_utc()
  -- Get identifier based on configuration
  local identifier = get_identifier(conf)
  if not identifier then
    kong.log.err("cannot identify the client, rate limiting skipped")
    return
  end
  -- Get request path and method
  local path = kong.request.get_path()
  local method = kong.request.get_method()
  -- Get request size
  local request_size = tonumber(kong.request.get_header("content-length")) or 0
  -- Calculate rate limiting weight based on request size
  local weight = 1
  if conf.weight_by_size and request_size > 0 then
    weight = math.ceil(request_size / 1024) -- 1 unit per KB
  end
  -- Apply path-specific limits if configured
  local limits = {}
  local path_matched = false
  if conf.path_limits then
    for _, path_limit in ipairs(conf.path_limits) do
      if path:match(path_limit.path) and (path_limit.method == "*" or path_limit.method == method) then
        limits = path_limit.limits
        path_matched = true
        break
      end
    end
  end
  -- Fall back to default limits if no path match
  if not path_matched then
    limits = {
      second = conf.second,
      minute = conf.minute,
      hour = conf.hour,
      day = conf.day,
      month = conf.month,
      year = conf.year,
    }
  end
  -- Remove empty limits
  for k, v in pairs(limits) do
    if not v or v == 0 then
      limits[k] = nil
    end
  end
  -- Check if any limits are defined
  if not next(limits) then
    return
  end
  -- Get current usage
  local usage, err = get_usage(conf, identifier, current_timestamp, limits)
  if err then
    if conf.fault_tolerant then
      kong.log.err("error getting usage: ", err)
      return
    else
      return kong.response.error(500, "Internal Server Error")
    end
  end
  -- Check if any limit is exceeded
  local stop_now = false
  for period, limit in pairs(limits) do
    if usage[period].remaining <= 0 then
      stop_now = true
      break
    end
  end
  -- If limit exceeded, return 429
  if stop_now then
    -- Add headers
    if not conf.hide_client_headers then
      for period, limit in pairs(limits) do
        kong.response.set_header("X-RateLimit-Limit-" .. period, limit)
        kong.response.set_header("X-RateLimit-Remaining-" .. period, usage[period].remaining)
      end
      if conf.retry_after_jitter_max > 0 then
        local retry_after = math.random(1, conf.retry_after_jitter_max)
        kong.response.set_header("Retry-After", retry_after)
      end
    end
    return kong.response.error(429, "API rate limit exceeded")
  end
  -- Increment usage
  local ok, err = increment_usage(conf, identifier, current_timestamp, limits, weight)
  if not ok then
    if conf.fault_tolerant then
      kong.log.err("error incrementing usage: ", err)
    else
      return kong.response.error(500, "Internal Server Error")
    end
  end
  -- Add headers
  if not conf.hide_client_headers then
    for period, limit in pairs(limits) do
      kong.response.set_header("X-RateLimit-Limit-" .. period, limit)
      kong.response.set_header("X-RateLimit-Remaining-" .. period, math.max(0, usage[period].remaining - weight))
    end
  end
end
return AdvanacedRateLimiting

• Long-term: Implemented a comprehensive API management strategy:

Created a multi-layer rate limiting approach (global, service, route, consumer)

Implemented token-based authentication for all API clients

Deployed distributed rate limiting with Redis cluster

Added circuit breakers to prevent cascading failures

Implemented real-time monitoring and alerting for API traffic patterns

Lessons Learned:

Effective API rate limiting requires a multi-dimensional approach beyond simple IP-based throttling.

How to Avoid:

Implement rate limiting based on multiple identifiers (IP, token, consumer).
Use distributed rate limiting with proper storage backends.
Test rate limiting with realistic traffic patterns including proxy scenarios.
Monitor and alert on unusual traffic patterns before they cause issues.
Implement circuit breakers to protect backend services.

Answer 5

output:

API Gateway and Service Mesh Kong API Gateway, Kubernetes, Production environment

Summary:

No summary provided

What Happened:

A company launched a major marketing campaign that drove significant traffic to their APIs. Despite capacity planning for the increased load, several critical services became unresponsive. Users reported timeouts and error responses, while backend services showed minimal resource utilization. The issue persisted despite scaling up backend services, suggesting a bottleneck elsewhere in the system.

Diagnosis Steps:

Analyzed API gateway logs and metrics during the incident.
Reviewed recent configuration changes to the API gateway.
Examined rate limiting policies across different services.
Tested API calls with different authentication credentials.
Compared gateway configuration across environments.

Root Cause:

The investigation revealed multiple issues with the API gateway configuration: 1. Global rate limiting was configured too aggressively at 100 requests per minute per IP 2. Rate limiting was applied at the wrong level (IP-based instead of token-based) 3. The rate limiting plugin was configured with a "redis" policy but the Redis cluster was undersized 4. Marketing campaign traffic was not exempted from rate limiting 5. Rate limiting headers were not being returned to clients, preventing proper backoff

Fix/Workaround:

• Short-term: Implemented immediate configuration fixes in Kong:


# Before: Problematic Kong rate limiting configuration
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: global-rate-limiting
  namespace: api-gateway
config:
  minute: 100
  limit_by: ip
  policy: redis
  redis_host: redis-master
  redis_port: 6379
  redis_timeout: 2000
  redis_database: 0
  hide_client_headers: true
plugin: rate-limiting
# After: Improved Kong rate limiting configuration
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: global-rate-limiting
  namespace: api-gateway
config:
  minute: 300
  limit_by: credential
  policy: redis
  redis_host: redis-master
  redis_port: 6379
  redis_timeout: 2000
  redis_database: 0
  hide_client_headers: false
plugin: rate-limiting
---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: marketing-rate-limiting
  namespace: api-gateway
config:
  minute: 1000
  limit_by: credential
  policy: redis
  redis_host: redis-master
  redis_port: 6379
  redis_timeout: 2000
  redis_database: 0
  hide_client_headers: false
plugin: rate-limiting

• Implemented service-specific rate limiting with proper consumer segmentation:


# Service-specific rate limiting
apiVersion: configuration.konghq.com/v1
kind: KongIngress
metadata:
  name: payment-service-config
config:
  plugins:
  - name: payment-rate-limiting
    config:
      minute: 200
      limit_by: credential
      policy: redis
---
apiVersion: configuration.konghq.com/v1
kind: KongConsumer
metadata:
  name: marketing-api-consumer
  annotations:
    kubernetes.io/ingress.class: kong
username: marketing-api
credentials:
- marketing-api-key
---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: consumer-specific-rate-limiting
config:
  minute: 1000
  limit_by: credential
  policy: redis
  redis_host: redis-master
  redis_port: 6379
  redis_timeout: 2000
  redis_database: 0
  hide_client_headers: false
plugin: rate-limiting

• Implemented a custom rate limiting plugin in Lua for advanced use cases:


-- custom-rate-limiting.lua
local redis = require "resty.redis"
local timestamp = require "kong.tools.timestamp"
local policy_cluster = require "kong.plugins.rate-limiting.policies.cluster"
local kong = kong
local ngx = ngx
local max = math.max
local floor = math.floor
local EMPTY = {}
local EXPIRATION = 60 * 60 -- 1 hour in seconds
local CustomRateLimiting = {}
CustomRateLimiting.PRIORITY = 901
CustomRateLimiting.VERSION = "1.0.0"
local function get_identifier(conf)
    local identifier
    if conf.limit_by == "credential" then
        identifier = (kong.client.get_credential() or EMPTY).id
    elseif conf.limit_by == "consumer" then
        identifier = (kong.client.get_consumer() or EMPTY).id
    elseif conf.limit_by == "ip" then
        identifier = kong.client.get_forwarded_ip()
    elseif conf.limit_by == "service" then
        identifier = (kong.router.get_service() or EMPTY).id
    elseif conf.limit_by == "header" then
        identifier = kong.request.get_header(conf.header_name)
    elseif conf.limit_by == "path" then
        identifier = kong.request.get_path()
    end
    return identifier or kong.client.get_forwarded_ip()
end
local function get_usage(conf, identifier, current_timestamp, limits)
    local usage = {}
    local stop
    -- Custom business logic for rate limiting
    if conf.business_tier == "premium" then
        -- Premium tier gets higher limits
        for k, v in pairs(limits) do
            limits[k] = v * 2
        end
    elseif conf.business_tier == "marketing" then
        -- Marketing campaigns get even higher limits
        for k, v in pairs(limits) do
            limits[k] = v * 5
        end
    end
    -- Use the policy defined in the configuration
    if conf.policy == "redis" then
        local red = redis:new()
        red:set_timeout(conf.redis_timeout)
        local ok, err = red:connect(conf.redis_host, conf.redis_port)
        if not ok then
            kong.log.err("failed to connect to Redis: ", err)
            return nil, nil, err
        end
        if conf.redis_password and conf.redis_password ~= "" then
            local ok, err = red:auth(conf.redis_password)
            if not ok then
                kong.log.err("failed to authenticate with Redis: ", err)
                return nil, nil, err
            end
        end
        if conf.redis_database ~= 0 then
            local ok, err = red:select(conf.redis_database)
            if not ok then
                kong.log.err("failed to change Redis database: ", err)
                return nil, nil, err
            end
        end
        local keys = {}
        for period, limit in pairs(limits) do
            table.insert(keys, "ratelimit:" .. identifier .. ":" .. period .. ":" .. conf.service_id)
        end
        red:init_pipeline()
        for _, key in ipairs(keys) do
            red:get(key)
        end
        local counts, err = red:commit_pipeline()
        if not counts then
            kong.log.err("failed to get counts from Redis: ", err)
            return nil, nil, err
        end
        local periods = {}
        for period in pairs(limits) do
            table.insert(periods, period)
        end
        for i, count in ipairs(counts) do
            local period = periods[i]
            if count == ngx.null then
                count = 0
            end
            usage[period] = tonumber(count)
            if usage[period] and limits[period] and usage[period] >= limits[period] then
                stop = true
            end
        end
        -- Add current request to counts
        red:init_pipeline()
        for period, limit in pairs(limits) do
            local key = "ratelimit:" .. identifier .. ":" .. period .. ":" .. conf.service_id
            red:incr(key)
            red:expire(key, EXPIRATION)
        end
        local _, err = red:commit_pipeline()
        if err then
            kong.log.err("failed to increment counts in Redis: ", err)
            return nil, nil, err
        end
        local ok, err = red:set_keepalive(10000, 100)
        if not ok then
            kong.log.err("failed to set Redis keepalive: ", err)
        end
    else
        -- Fall back to local policy
        return policy_cluster.usage(conf, identifier, current_timestamp, limits)
    end
    return usage, stop
end
function CustomRateLimiting:access(conf)
    local current_timestamp = timestamp.get_utc()
    -- Get the identification of the consumer
    local identifier = get_identifier(conf)
    if not identifier then
        kong.log.err("cannot identify the consumer, rate limiting skipped")
        return
    end
    -- Load and parse consumer metadata for custom limits
    local consumer = kong.client.get_consumer()
    local custom_limits = {}
    if consumer then
        local metadata = kong.db.consumers:select_by_id(consumer.id).meta
        if metadata and metadata.custom_rate_limits then
            for k, v in pairs(metadata.custom_rate_limits) do
                custom_limits[k] = v
            end
        end
    end
    -- Build the limits table based on conf
    local limits = {}
    if conf.second and conf.second > 0 then
        limits.second = custom_limits.second or conf.second
    end
    if conf.minute and conf.minute > 0 then
        limits.minute = custom_limits.minute or conf.minute
    end
    if conf.hour and conf.hour > 0 then
        limits.hour = custom_limits.hour or conf.hour
    end
    if conf.day and conf.day > 0 then
        limits.day = custom_limits.day or conf.day
    end
    if conf.month and conf.month > 0 then
        limits.month = custom_limits.month or conf.month
    end
    if conf.year and conf.year > 0 then
        limits.year = custom_limits.year or conf.year
    end
    -- Check if any of the limits is set
    if not next(limits) then
        kong.log.err("no limit is specified, rate limiting skipped")
        return
    end
    -- Get the usage of the consumer
    local usage, stop, err = get_usage(conf, identifier, current_timestamp, limits)
    if err then
        kong.log.err("failed to get usage: ", err)
        return
    end
    -- If the consumer exceeded any of the limits, reject the request
    if stop then
        return kong.response.exit(429, { message = "API rate limit exceeded" })
    end
    -- Append the X-RateLimit-* headers if not disabled
    if not conf.hide_client_headers then
        for k, v in pairs(usage) do
            kong.response.set_header("X-RateLimit-" .. k .. "-Limit", limits[k])
            kong.response.set_header("X-RateLimit-" .. k .. "-Remaining", math.max(0, limits[k] - usage[k]))
        end
    end
    kong.ctx.plugin.rate_limit = {
        limit = limits,
        usage = usage,
    }
end
return CustomRateLimiting

• Long-term: Implemented a comprehensive API gateway management strategy:

Created a centralized rate limiting configuration management system

Implemented dynamic rate limiting based on service health metrics

Developed a rate limiting testing framework

Established clear procedures for rate limiting policy changes

Implemented monitoring and alerting for rate limiting events

Lessons Learned:

API gateway rate limiting requires careful configuration to balance protection and availability.

How to Avoid:

Implement rate limiting based on authentication tokens, not IP addresses.
Configure appropriate limits based on service capacity and user tiers.
Test rate limiting policies under load before deployment.
Return rate limiting headers to clients for proper backoff implementation.
Monitor rate limiting metrics and adjust policies as needed.

Answer 6

output:

API Gateway and Service Mesh Istio, Kubernetes, Production environment with mutual TLS

Summary:

No summary provided

What Happened:

During a scheduled maintenance window, the operations team initiated an automated certificate rotation for the Istio service mesh. Shortly after the rotation began, services started experiencing connection failures. Within minutes, the failure cascaded across the entire mesh, resulting in a complete production outage. The incident affected all services using mutual TLS for communication, which included critical business applications.

Diagnosis Steps:

Analyzed Istio control plane logs to understand the certificate rotation process.
Examined Envoy proxy logs from affected workloads.
Reviewed certificate issuance and distribution metrics.
Checked Kubernetes events and pod status across the cluster.
Monitored network traffic patterns between services.

Root Cause:

The investigation revealed multiple issues with the certificate rotation process: 1. The certificate rotation was triggered while some nodes were undergoing maintenance 2. The Istio control plane had insufficient resources to handle the certificate generation load 3. A race condition in the certificate distribution process caused some workloads to receive incomplete certificate chains 4. The certificate validation in Envoy proxies was too strict, rejecting certificates with minor issues 5. There was no graceful fallback mechanism when certificate validation failed

Fix/Workaround:

• Short-term: Implemented immediate fixes to restore service:


# Rollback to previous certificates
kubectl rollout restart deployment istiod -n istio-system
# Force reload of proxies with previous certificates
kubectl get pods --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.containerStatuses[*].ready | grep -v "true" | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n

• Created a more robust certificate rotation script:


#!/bin/bash
# safe_cert_rotation.sh - Safely rotate Istio certificates with validation
set -e
# Configuration
NAMESPACE="istio-system"
ISTIOD_DEPLOYMENT="istiod"
CERT_VALIDITY_DAYS=30
MAX_UNAVAILABLE_PERCENT=10
ROTATION_TIMEOUT=1800  # 30 minutes
VALIDATION_INTERVAL=10
ROLLBACK_ON_FAILURE=true
# Check prerequisites
if ! command -v kubectl &> /dev/null; then
    echo "kubectl not found. Please install kubectl."
    exit 1
fi
if ! command -v jq &> /dev/null; then
    echo "jq not found. Please install jq."
    exit 1
fi
# Verify cluster connectivity
echo "Verifying cluster connectivity..."
kubectl get nodes &> /dev/null || { echo "Cannot connect to Kubernetes cluster"; exit 1; }
# Check Istio control plane health
echo "Checking Istio control plane health..."
ISTIOD_READY=$(kubectl get deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE -o jsonpath='{.status.readyReplicas}')
ISTIOD_TOTAL=$(kubectl get deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE -o jsonpath='{.status.replicas}')
if [ "$ISTIOD_READY" != "$ISTIOD_TOTAL" ]; then
    echo "Warning: Istio control plane is not fully ready ($ISTIOD_READY/$ISTIOD_TOTAL replicas ready)"
    read -p "Continue anyway? (y/n) " -n 1 -r
    echo
    if [[ ! $REPLY =~ ^[Yy]$ ]]; then
        exit 1
    fi
fi
# Check for ongoing node maintenance
NODES_NOT_READY=$(kubectl get nodes -o jsonpath='{.items[?(@.status.conditions[?(@.type=="Ready")].status!="True")].metadata.name}')
if [ ! -z "$NODES_NOT_READY" ]; then
    echo "Warning: Some nodes are not ready: $NODES_NOT_READY"
    read -p "Continue anyway? (y/n) " -n 1 -r
    echo
    if [[ ! $REPLY =~ ^[Yy]$ ]]; then
        exit 1
    fi
fi
# Backup current certificates
echo "Backing up current certificates..."
BACKUP_DIR="istio-certs-backup-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
kubectl get secret -n $NAMESPACE -l istio.io/cert-management=true -o json > "$BACKUP_DIR/cert-secrets.json"
kubectl get configmap -n $NAMESPACE -l istio.io/cert-management=true -o json > "$BACKUP_DIR/cert-configmaps.json"
echo "Certificates backed up to $BACKUP_DIR"
# Scale up Istio control plane for rotation
echo "Scaling up Istio control plane for certificate rotation..."
ORIGINAL_REPLICAS=$ISTIOD_READY
ROTATION_REPLICAS=$((ORIGINAL_REPLICAS + 2))
kubectl scale deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --replicas=$ROTATION_REPLICAS
echo "Waiting for control plane scale up..."
kubectl rollout status deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --timeout=300s
# Start certificate rotation
echo "Initiating certificate rotation..."
kubectl delete secret cacerts -n $NAMESPACE || true
# Generate new root certificate with longer validity
cat > ca.conf << EOF
[ req ]
default_bits = 4096
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn
[ dn ]
O = Example Organization
CN = Example Root CA
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
DNS.1 = istiod.istio-system.svc
[ v3_ca ]
basicConstraints = critical, CA:TRUE
keyUsage = critical, digitalSignature, keyEncipherment, keyCertSign
EOF
openssl genrsa -out root-key.pem 4096
openssl req -new -key root-key.pem -config ca.conf -out root-cert.csr
openssl x509 -req -days $CERT_VALIDITY_DAYS -in root-cert.csr -signkey root-key.pem -out root-cert.pem -extensions v3_ca -extfile ca.conf
# Generate intermediate certificates
openssl genrsa -out ca-key.pem 4096
openssl req -new -key ca-key.pem -out ca-cert.csr -config ca.conf
openssl x509 -req -days $CERT_VALIDITY_DAYS -in ca-cert.csr -CA root-cert.pem -CAkey root-key.pem -CAcreateserial -out ca-cert.pem -extensions v3_ca -extfile ca.conf
# Create chain certificate
cat ca-cert.pem root-cert.pem > cert-chain.pem
# Create Kubernetes secret
kubectl create secret generic cacerts -n $NAMESPACE \
    --from-file=ca-cert.pem \
    --from-file=ca-key.pem \
    --from-file=root-cert.pem \
    --from-file=cert-chain.pem
# Restart Istio control plane to pick up new certificates
echo "Restarting Istio control plane with new certificates..."
kubectl rollout restart deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE
kubectl rollout status deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --timeout=300s
# Monitor certificate distribution
echo "Monitoring certificate distribution..."
start_time=$(date +%s)
end_time=$((start_time + ROTATION_TIMEOUT))
success=false
while [ $(date +%s) -lt $end_time ]; do
    # Check certificate distribution progress
    total_pods=$(kubectl get pods --all-namespaces -l istio.io/rev -o json | jq '.items | length')
    updated_pods=$(kubectl get pods --all-namespaces -l istio.io/rev -o json | jq '[.items[] | select(.metadata.annotations["istio.io/cert-update-status"] == "updated")] | length')
    if [ "$total_pods" -eq 0 ]; then
        echo "No Istio-injected pods found. Is Istio properly installed?"
        break
    fi
    percent_complete=$((updated_pods * 100 / total_pods))
    echo "Certificate rotation progress: $percent_complete% ($updated_pods/$total_pods pods updated)"
    if [ "$percent_complete" -eq 100 ]; then
        echo "Certificate rotation completed successfully!"
        success=true
        break
    fi
    # Check for failures
    failed_pods=$(kubectl get pods --all-namespaces -o json | jq '[.items[] | select(.status.containerStatuses != null) | select(.status.containerStatuses[].ready == false)] | length')
    failed_percent=$((failed_pods * 100 / total_pods))
    if [ "$failed_percent" -gt "$MAX_UNAVAILABLE_PERCENT" ]; then
        echo "Error: Too many pods are failing ($failed_percent% > $MAX_UNAVAILABLE_PERCENT%)"
        if [ "$ROLLBACK_ON_FAILURE" = true ]; then
            echo "Initiating rollback..."
            break
        fi
    fi
    sleep $VALIDATION_INTERVAL
done
# Validate service mesh health
echo "Validating service mesh health..."
if ! $success; then
    echo "Certificate rotation did not complete within the timeout period or failed"
    if [ "$ROLLBACK_ON_FAILURE" = true ]; then
        echo "Rolling back to previous certificates..."
        kubectl delete secret cacerts -n $NAMESPACE
        kubectl create -f "$BACKUP_DIR/cert-secrets.json"
        kubectl rollout restart deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE
        kubectl rollout status deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --timeout=300s
        echo "Rollback completed. Restoring original replica count..."
        kubectl scale deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --replicas=$ORIGINAL_REPLICAS
        exit 1
    fi
fi
# Scale down Istio control plane to original size
echo "Scaling down Istio control plane to original size..."
kubectl scale deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --replicas=$ORIGINAL_REPLICAS
kubectl rollout status deployment $ISTIOD_DEPLOYMENT -n $NAMESPACE --timeout=300s
echo "Certificate rotation completed successfully"
exit 0

• Implemented a Go-based certificate validation tool:


// certvalidator/main.go
package main
import (
	"context"
	"crypto/x509"
	"encoding/pem"
	"flag"
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"path/filepath"
	"strings"
	"sync"
	"time"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
)
type CertInfo struct {
	Subject     string
	Issuer      string
	NotBefore   time.Time
	NotAfter    time.Time
	IsCA        bool
	DNSNames    []string
	KeyUsage    x509.KeyUsage
	ExtKeyUsage []x509.ExtKeyUsage
}
func main() {
	var kubeconfig string
	var namespace string
	var allNamespaces bool
	var verbose bool
	var threshold int
	flag.StringVar(&kubeconfig, "kubeconfig", "", "Path to kubeconfig file")
	flag.StringVar(&namespace, "namespace", "istio-system", "Namespace to check")
	flag.BoolVar(&allNamespaces, "all-namespaces", false, "Check all namespaces")
	flag.BoolVar(&verbose, "verbose", false, "Verbose output")
	flag.IntVar(&threshold, "threshold", 30, "Warning threshold for certificate expiration in days")
	flag.Parse()
	// Create Kubernetes client
	var config *rest.Config
	var err error
	if kubeconfig == "" {
		log.Println("Using in-cluster configuration")
		config, err = rest.InClusterConfig()
	} else {
		log.Printf("Using configuration from %s", kubeconfig)
		config, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
	}
	if err != nil {
		log.Fatalf("Error building kubeconfig: %v", err)
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		log.Fatalf("Error creating Kubernetes client: %v", err)
	}
	// Get namespaces to check
	var namespaces []string
	if allNamespaces {
		nsList, err := clientset.CoreV1().Namespaces().List(context.TODO(), metav1.ListOptions{})
		if err != nil {
			log.Fatalf("Error listing namespaces: %v", err)
		}
		for _, ns := range nsList.Items {
			namespaces = append(namespaces, ns.Name)
		}
	} else {
		namespaces = []string{namespace}
	}
	// Check certificates in each namespace
	var wg sync.WaitGroup
	for _, ns := range namespaces {
		wg.Add(1)
		go func(namespace string) {
			defer wg.Done()
			checkNamespace(clientset, namespace, threshold, verbose)
		}(ns)
	}
	wg.Wait()
}
func checkNamespace(clientset *kubernetes.Clientset, namespace string, threshold int, verbose bool) {
	log.Printf("Checking certificates in namespace %s", namespace)
	// Check secrets
	secrets, err := clientset.CoreV1().Secrets(namespace).List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		log.Printf("Error listing secrets in namespace %s: %v", namespace, err)
		return
	}
	for _, secret := range secrets.Items {
		// Skip non-TLS secrets
		if !strings.Contains(secret.Type, "tls") && !strings.Contains(secret.Type, "TLS") {
			continue
		}
		log.Printf("Checking secret %s/%s", namespace, secret.Name)
		// Check each certificate in the secret
		for key, data := range secret.Data {
			if !strings.Contains(key, "crt") && !strings.Contains(key, "cert") && !strings.Contains(key, "ca.pem") {
				continue
			}
			certInfo, err := parseCertificate(data)
			if err != nil {
				log.Printf("Error parsing certificate in %s/%s[%s]: %v", namespace, secret.Name, key, err)
				continue
			}
			// Check certificate validity
			now := time.Now()
			if now.Before(certInfo.NotBefore) {
				log.Printf("WARNING: Certificate in %s/%s[%s] is not yet valid (valid from %s)", 
					namespace, secret.Name, key, certInfo.NotBefore)
			}
			if now.After(certInfo.NotAfter) {
				log.Printf("ERROR: Certificate in %s/%s[%s] has expired (valid until %s)", 
					namespace, secret.Name, key, certInfo.NotAfter)
			}
			daysUntilExpiration := int(certInfo.NotAfter.Sub(now).Hours() / 24)
			if daysUntilExpiration < threshold {
				log.Printf("WARNING: Certificate in %s/%s[%s] will expire in %d days (on %s)", 
					namespace, secret.Name, key, daysUntilExpiration, certInfo.NotAfter)
			}
			if verbose {
				log.Printf("Certificate details for %s/%s[%s]:", namespace, secret.Name, key)
				log.Printf("  Subject: %s", certInfo.Subject)
				log.Printf("  Issuer: %s", certInfo.Issuer)
				log.Printf("  Valid from: %s to %s", certInfo.NotBefore, certInfo.NotAfter)
				log.Printf("  Is CA: %t", certInfo.IsCA)
				log.Printf("  DNS names: %v", certInfo.DNSNames)
			}
		}
	}
	// Check configmaps for certificates
	configmaps, err := clientset.CoreV1().ConfigMaps(namespace).List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		log.Printf("Error listing configmaps in namespace %s: %v", namespace, err)
		return
	}
	for _, configmap := range configmaps.Items {
		for key, data := range configmap.Data {
			if !strings.Contains(key, "crt") && !strings.Contains(key, "cert") && !strings.Contains(key, "ca.pem") {
				continue
			}
			certInfo, err := parseCertificate([]byte(data))
			if err != nil {
				log.Printf("Error parsing certificate in configmap %s/%s[%s]: %v", 
					namespace, configmap.Name, key, err)
				continue
			}
			// Check certificate validity
			now := time.Now()
			if now.Before(certInfo.NotBefore) {
				log.Printf("WARNING: Certificate in configmap %s/%s[%s] is not yet valid (valid from %s)", 
					namespace, configmap.Name, key, certInfo.NotBefore)
			}
			if now.After(certInfo.NotAfter) {
				log.Printf("ERROR: Certificate in configmap %s/%s[%s] has expired (valid until %s)", 
					namespace, configmap.Name, key, certInfo.NotAfter)
			}
			daysUntilExpiration := int(certInfo.NotAfter.Sub(now).Hours() / 24)
			if daysUntilExpiration < threshold {
				log.Printf("WARNING: Certificate in configmap %s/%s[%s] will expire in %d days (on %s)", 
					namespace, configmap.Name, key, daysUntilExpiration, certInfo.NotAfter)
			}
			if verbose {
				log.Printf("Certificate details for configmap %s/%s[%s]:", namespace, configmap.Name, key)
				log.Printf("  Subject: %s", certInfo.Subject)
				log.Printf("  Issuer: %s", certInfo.Issuer)
				log.Printf("  Valid from: %s to %s", certInfo.NotBefore, certInfo.NotAfter)
				log.Printf("  Is CA: %t", certInfo.IsCA)
				log.Printf("  DNS names: %v", certInfo.DNSNames)
			}
		}
	}
}
func parseCertificate(data []byte) (*CertInfo, error) {
	block, _ := pem.Decode(data)
	if block == nil {
		return nil, fmt.Errorf("failed to decode PEM block")
	}
	cert, err := x509.ParseCertificate(block.Bytes)
	if err != nil {
		return nil, err
	}
	return &CertInfo{
		Subject:     cert.Subject.String(),
		Issuer:      cert.Issuer.String(),
		NotBefore:   cert.NotBefore,
		NotAfter:    cert.NotAfter,
		IsCA:        cert.IsCA,
		DNSNames:    cert.DNSNames,
		KeyUsage:    cert.KeyUsage,
		ExtKeyUsage: cert.ExtKeyUsage,
	}, nil
}

• Updated Istio configuration for more resilient certificate handling:


# istio-operator.yaml - Updated configuration for certificate handling
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  namespace: istio-system
  name: istio-control-plane
spec:
  profile: default
  components:
    pilot:
      k8s:
        resources:
          requests:
            cpu: 500m
            memory: 2Gi
          limits:
            cpu: 2000m
            memory: 4Gi
        hpaSpec:
          minReplicas: 3
          maxReplicas: 10
          metrics:
            - type: Resource
              resource:
                name: cpu
                targetAverageUtilization: 80
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
  meshConfig:
    defaultConfig:
      proxyMetadata:
        ISTIO_META_DNS_CAPTURE: "true"
        ISTIO_META_DNS_AUTO_ALLOCATE: "true"
    enablePrometheusMerge: true
    enableTracing: true
    accessLogFile: "/dev/stdout"
    accessLogFormat: |
      [%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS% %RESPONSE_CODE_DETAILS% %CONNECTION_TERMINATION_DETAILS% "%UPSTREAM_TRANSPORT_FAILURE_REASON%" %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" %UPSTREAM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME% %ROUTE_NAME%
    rootNamespace: istio-system
    trustDomain: cluster.local
    caCertificatesPersistenceEnabled: true
    certificateRotationPeriod: 720h  # 30 days
    certificateRotationGracePeriod: 168h  # 7 days
    defaultServiceExportTo:
    - "*"
    defaultVirtualServiceExportTo:
    - "*"
    defaultDestinationRuleExportTo:
    - "*"
  values:
    global:
      proxy:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 2000m
            memory: 1024Mi
        holdApplicationUntilProxyStarts: true
        tracer:
          zipkin:
            address: zipkin.istio-system:9411
      proxy_init:
        resources:
          limits:
            cpu: 2000m
            memory: 1024Mi
          requests:
            cpu: 10m
            memory: 10Mi
      logging:
        level: "default:info"
      pilotCertProvider: istiod
      jwtPolicy: third-party-jwt
      caAddress: istiod.istio-system.svc:15012
      mountMtlsCerts: true
    pilot:
      env:
        PILOT_CERT_PROVIDER: istiod
        PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND: "true"
        PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_INBOUND: "true"
        PILOT_ENABLE_CERTIFICATE_ROTATION_GRACE_PERIOD: "true"
        PILOT_CERTIFICATE_ROTATION_GRACE_PERIOD_PERCENT: "20"
        PILOT_ENABLE_CERTIFICATE_ROTATION_FAILURE_RECOVERY: "true"
        PILOT_CERTIFICATE_ROTATION_FAILURE_RETRY_DELAY: "1m"
        PILOT_CERTIFICATE_ROTATION_MAX_RETRIES: "10"

• Long-term: Implemented a comprehensive certificate management strategy:

Created a certificate rotation runbook with pre-flight checks

Implemented automated certificate monitoring with alerting

Developed a certificate rotation testing framework

Established clear incident response procedures for certificate issues

Implemented certificate rotation simulation in chaos testing

Lessons Learned:

Certificate rotation in service meshes requires careful planning and robust fallback mechanisms.

How to Avoid:

Implement proper resource allocation for certificate management components.
Create a gradual certificate rotation strategy with validation at each step.
Test certificate rotation procedures in non-production environments.
Implement monitoring for certificate-related metrics and alerts.
Establish clear rollback procedures for certificate rotation failures.

Answer 7

output:

API Gateway and Service Mesh Istio, Kubernetes, Production microservices

Summary:

No summary provided

What Happened:

At 2:00 AM, monitoring systems detected a sudden spike in connection failures across multiple services in a production Kubernetes cluster using Istio service mesh. Users reported widespread "connection refused" errors, and internal services were unable to communicate with each other. The incident coincided with the scheduled expiration of TLS certificates used for mTLS communication within the service mesh. The automated certificate rotation process had failed silently several days earlier, but the issue only became apparent when the certificates actually expired.

Diagnosis Steps:

Analyzed Istio proxy logs to identify TLS handshake failures.
Checked certificate expiration dates using OpenSSL commands.
Reviewed certificate issuance and rotation automation logs.
Examined Istio control plane components for errors.
Verified certificate authority (CA) functionality.

Root Cause:

The investigation revealed multiple issues with the certificate management: 1. The certificate rotation job had failed due to an API permission change 2. No alerting was configured for failed certificate rotation attempts 3. Certificate expiration monitoring was missing 4. The rotation job failure was logged but not escalated 5. Certificate lifetimes were too short (7 days) with no buffer period

Fix/Workaround:

• Implemented immediate manual certificate rotation to restore service

• Created a comprehensive certificate management strategy

• Added monitoring and alerting for certificate expiration

• Extended certificate lifetimes with appropriate overlap

• Implemented automated testing of the rotation process

Lessons Learned:

Certificate management in service meshes requires robust automation, monitoring, and failure detection.

How to Avoid:

Implement certificate expiration monitoring with alerts at multiple thresholds.
Configure longer certificate lifetimes with appropriate overlap periods.
Test certificate rotation processes regularly in non-production environments.
Create alerting for failed rotation attempts, not just expiration events.
Document and practice manual certificate rotation procedures.

Answer 8

output:

API Gateway and Service Mesh Kong API Gateway, Kubernetes, Production microservices

Summary:

No summary provided

What Happened:

During a marketing campaign launch, a company's public API experienced a sudden traffic surge. Despite having rate limiting configured in the Kong API Gateway, the traffic overwhelmed backend services, causing cascading failures across the platform. The operations team had to implement emergency measures to restore service, including temporarily blocking certain client IPs and scaling up backend services. Post-incident analysis revealed that the rate limiting configuration was ineffective under the specific traffic patterns experienced.

Diagnosis Steps:

Analyzed API gateway logs for traffic patterns and rate limiting behavior.
Examined backend service metrics during the incident.
Reviewed rate limiting configuration in the API gateway.
Tested rate limiting effectiveness under various traffic patterns.
Compared global and route-specific rate limiting settings.

Root Cause:

The investigation revealed multiple issues with the rate limiting configuration: 1. Rate limits were configured per route but not globally across routes 2. The rate limiting window was too large (1 minute), allowing traffic bursts 3. Rate limiting was based on client IP, but traffic came through a load balancer with IP masking 4. No rate limiting was applied to authenticated vs. unauthenticated requests 5. The rate limiting plugin was configured with "continue on error" mode

Fix/Workaround:

• Implemented immediate fixes to protect backend services

• Reconfigured rate limiting with appropriate granularity

• Added global and route-specific limits with proper windows

• Implemented advanced identification beyond client IP

• Created tiered rate limiting based on client importance

Lessons Learned:

API gateway rate limiting requires careful configuration and testing under realistic traffic patterns.

How to Avoid:

Implement multi-layered rate limiting (global, service, route).
Test rate limiting under various traffic patterns, including bursts.
Configure appropriate identification methods beyond client IP.
Create tiered rate limiting based on client authentication and importance.
Monitor rate limiting effectiveness and adjust based on traffic patterns.

Answer 9

output:

API Gateway and Service Mesh Kong API Gateway, Kubernetes, Production microservices

Summary:

No summary provided

What Happened:

A company implemented JWT-based authentication for their APIs using Kong API Gateway. After several weeks in production, the security team discovered that certain protected endpoints were accessible without valid authentication. Investigation revealed that the JWT validation configuration in the API gateway was incorrectly implemented, allowing requests with malformed or expired tokens to pass through to backend services. This created a significant security vulnerability that potentially exposed sensitive data.

Diagnosis Steps:

Analyzed API gateway logs for authentication patterns.
Tested endpoints with various token configurations.
Reviewed JWT validation plugin configuration.
Examined token issuance and validation flow.
Verified claims validation and signature verification settings.

Root Cause:

The investigation revealed multiple issues with the JWT validation: 1. The JWT signature verification was misconfigured with incorrect public keys 2. Token expiration validation was not properly enforced 3. Required claims validation was incomplete 4. The plugin configuration was inconsistently applied across routes 5. Error handling allowed certain invalid tokens to pass through

Fix/Workaround:

• Implemented immediate fixes to secure all endpoints

• Corrected JWT validation configuration with proper signature verification

• Enforced token expiration and claims validation

• Standardized plugin configuration across all routes

• Improved error handling for invalid tokens

Lessons Learned:

API gateway authentication requires careful configuration and comprehensive testing.

How to Avoid:

Implement comprehensive security testing for API gateway configurations.
Create automated validation tests for authentication mechanisms.
Standardize authentication plugin configuration across routes.
Regularly audit authentication logs for unusual patterns.
Establish clear ownership and review processes for security configurations.
```yaml
Example of proper Kong JWT plugin configuration
plugins:
- name: jwt
config:
Properly configured claims validation
claims_to_verify:
- exp
- nbf
Multiple signature verification algorithms
algorithms:
- RS256
- ES256
Proper key configuration
key_claim_name: kid
secret_is_base64: false
Comprehensive validation settings
run_on_preflight: true
maximum_expiration: 86400
Proper error handling
uri_param_names:
- jwt
cookie_names: []
header_names:
- Authorization
Enforce token format
token_format:
bearer: true
base64: false
```
```go
// Example Go code for proper JWT token validation
package main
import (
"fmt"
"net/http"
"strings"
"time"
"github.com/golang-jwt/jwt/v4"
)
// Define custom claims with proper validation fields
type CustomClaims struct {
Permissions []string json:"permissions"
jwt.RegisteredClaims
}
// JWT validation middleware with comprehensive checks
func JWTMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Extract token from Authorization header
authHeader := r.Header.Get("Authorization")
if authHeader == "" {
http.Error(w, "Authorization header required", http.StatusUnauthorized)
return
}
// Validate Bearer format
bearerPrefix := "Bearer "
if !strings.HasPrefix(authHeader, bearerPrefix) {
http.Error(w, "Invalid authorization format", http.StatusUnauthorized)
return
}
// Extract token
tokenString := strings.TrimPrefix(authHeader, bearerPrefix)
// Parse and validate token with custom claims
token, err := jwt.ParseWithClaims(tokenString, &CustomClaims{}, func(token *jwt.Token) (interface{}, error) {
// Validate signing algorithm
if _, ok := token.Method.(*jwt.SigningMethodRSA); !ok {
return nil, fmt.Errorf("unexpected signing method: %v", token.Header["alg"])
}
// Get key ID from token header
kid, ok := token.Header["kid"].(string)
if !ok {
return nil, fmt.Errorf("key ID not found in token")
}
// Retrieve public key based on key ID (implementation depends on key management)
publicKey, err := getPublicKey(kid)
if err != nil {
return nil, err
}
return publicKey, nil
})
// Handle validation errors with specific error messages
if err != nil {
switch {
case strings.Contains(err.Error(), "token is expired"):
http.Error(w, "Token expired", http.StatusUnauthorized)
case strings.Contains(err.Error(), "signature is invalid"):
http.Error(w, "Invalid token signature", http.StatusUnauthorized)
default:
http.Error(w, "Invalid token: "+err.Error(), http.StatusUnauthorized)
}
return
}
// Validate token claims
if claims, ok := token.Claims.(*CustomClaims); ok && token.Valid {
// Additional custom validation
if !hasRequiredPermissions(claims.Permissions, r.URL.Path, r.Method) {
http.Error(w, "Insufficient permissions", http.StatusForbidden)
return
}
// Set claims in request context for downstream handlers
ctx := setClaimsContext(r.Context(), claims)
next.ServeHTTP(w, r.WithContext(ctx))
} else {
http.Error(w, "Invalid token claims", http.StatusUnauthorized)
}
})
}
// Helper functions (implementation details omitted)
func getPublicKey(kid string) (interface{}, error) {
// Implementation would retrieve the correct public key based on key ID
return nil, nil
}
func hasRequiredPermissions(permissions []string, path, method string) bool {
// Implementation would check if the token has the required permissions
return true
}
func setClaimsContext(ctx interface{}, claims *CustomClaims) interface{} {
// Implementation would set the claims in the request context
return ctx
}

Answer 10

output:

API Gateway and Service Mesh Istio, Kubernetes, Production environment

Summary:

No summary provided

What Happened:

A financial services company using Istio service mesh for their microservices architecture experienced widespread authentication failures during peak business hours. Services began reporting TLS handshake errors, and inter-service communication broke down across multiple critical applications. The incident caused a partial outage affecting customer-facing services. Investigation revealed that the Istio-managed mTLS certificates had expired, and the automatic rotation mechanism had silently failed weeks earlier without triggering alerts.

Diagnosis Steps:

Analyzed service mesh proxy logs for error patterns.
Examined certificate expiration dates across the mesh.
Reviewed certificate issuance and rotation configurations.
Checked certificate authority (CA) status and health.
Investigated recent changes to the service mesh configuration.

Root Cause:

The investigation revealed multiple issues with certificate management: 1. The Istio certificate authority (istiod) had insufficient permissions to write to the certificate storage location 2. Certificate rotation logs were being discarded due to a misconfigured log level 3. No monitoring was in place for certificate expiration or rotation failures 4. A recent security hardening change had modified the certificate storage permissions 5. The certificate rotation failure predated the expiration by several weeks

Fix/Workaround:

• Implemented immediate fixes to restore service

• Manually rotated all expired certificates

• Corrected permissions for certificate storage locations

• Configured proper logging for certificate operations

• Implemented certificate expiration monitoring and alerting


# Istio Certificate Monitoring Configuration
# File: istio-cert-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-cert-expiry-alerts
  namespace: istio-system
spec:
  groups:
  - name: istio-cert-expiry
    rules:
    # Alert when workload certificates are nearing expiration
    - alert: IstioWorkloadCertExpiringSoon
      expr: |
        (
          max by(source_workload, source_namespace) (
            envoy_server_ssl_socket_factory_context_ssl_context_days_until_first_cert_expires{
              reporter="source"
            }
          ) < 7
        )
      for: 1h
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "Istio workload certificate expiring soon"
        description: "Workload {{ $labels.source_workload }} in namespace {{ $labels.source_namespace }} has a certificate that will expire in {{ $value }} days."
    # Critical alert for imminent expiration
    - alert: IstioWorkloadCertExpiringCritical
      expr: |
        (
          max by(source_workload, source_namespace) (
            envoy_server_ssl_socket_factory_context_ssl_context_days_until_first_cert_expires{
              reporter="source"
            }
          ) < 2
        )
      for: 10m
      labels:
        severity: critical
        team: platform
      annotations:
        summary: "Istio workload certificate critically close to expiration"
        description: "CRITICAL: Workload {{ $labels.source_workload }} in namespace {{ $labels.source_namespace }} has a certificate that will expire in {{ $value }} days."
    # Alert when istiod certificates are nearing expiration
    - alert: IstiodCertExpiringSoon
      expr: |
        (
          max by(job) (
            citadel_server_cert_expiry_seconds / 86400 < 30
          )
        )
      for: 1h
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "Istiod certificate expiring soon"
        description: "Istiod certificate will expire in {{ $value }} days."
    # Alert for certificate rotation failures
    - alert: IstioCertRotationFailure
      expr: |
        increase(citadel_server_csr_sign_error_count[1h]) > 0
      for: 15m
      labels:
        severity: critical
        team: platform
      annotations:
        summary: "Istio certificate rotation failures detected"
        description: "Istio certificate rotation has been failing for the last 15 minutes."
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-cert-checker
  namespace: istio-system
data:
  cert-checker.sh: |
    #!/bin/bash
    # Script to check Istio certificate health
    # Check istiod certificate
    echo "Checking istiod certificate..."
    ISTIOD_CERT_EXPIRY=$(kubectl exec -n istio-system deployment/istiod -- sh -c "openssl x509 -in /etc/certs/cert-chain.pem -noout -dates | grep notAfter | cut -d= -f2")
    ISTIOD_EXPIRY_SECONDS=$(date -d "$ISTIOD_CERT_EXPIRY" +%s)
    NOW_SECONDS=$(date +%s)
    DAYS_REMAINING=$(( ($ISTIOD_EXPIRY_SECONDS - $NOW_SECONDS) / 86400 ))
    echo "Istiod certificate expires in $DAYS_REMAINING days"
    if [ $DAYS_REMAINING -lt 30 ]; then
      echo "WARNING: Istiod certificate expiring soon!"
    fi
    # Check workload certificates (sample of pods)
    echo "Checking workload certificates..."
    NAMESPACES=$(kubectl get namespace -l istio-injection=enabled -o jsonpath='{.items[*].metadata.name}')
    for NS in $NAMESPACES; do
      PODS=$(kubectl get pods -n $NS -o jsonpath='{.items[*].metadata.name}')
      for POD in $PODS; do
        if kubectl exec -n $NS $POD -c istio-proxy -- ls /etc/certs/cert-chain.pem > /dev/null 2>&1; then
          CERT_EXPIRY=$(kubectl exec -n $NS $POD -c istio-proxy -- sh -c "openssl x509 -in /etc/certs/cert-chain.pem -noout -dates | grep notAfter | cut -d= -f2")
          EXPIRY_SECONDS=$(date -d "$CERT_EXPIRY" +%s)
          POD_DAYS_REMAINING=$(( ($EXPIRY_SECONDS - $NOW_SECONDS) / 86400 ))
          echo "Pod $POD in namespace $NS: Certificate expires in $POD_DAYS_REMAINING days"
          if [ $POD_DAYS_REMAINING -lt 7 ]; then
            echo "WARNING: Certificate for $POD in $NS expiring soon!"
          fi
        fi
      done
    done
    # Check certificate rotation logs
    echo "Checking certificate rotation logs..."
    ROTATION_ERRORS=$(kubectl logs -n istio-system -l app=istiod --tail=1000 | grep -c "Failed to sign CSR")
    if [ $ROTATION_ERRORS -gt 0 ]; then
      echo "ERROR: Detected $ROTATION_ERRORS certificate rotation failures in recent logs!"
    else
      echo "No recent certificate rotation errors detected"
    fi
    # Check CA permissions
    echo "Checking CA permissions..."
    kubectl exec -n istio-system deployment/istiod -- ls -la /var/run/secrets/istio-dns
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: istio-cert-checker
  namespace: istio-system
spec:
  schedule: "0 */6 * * *"  # Run every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: istio-cert-checker
          containers:
          - name: cert-checker
            image: istio/kubectl:latest
            command:
            - /bin/bash
            - /scripts/cert-checker.sh
            volumeMounts:
            - name: scripts
              mountPath: /scripts
          volumes:
          - name: scripts
            configMap:
              name: istio-cert-checker
              defaultMode: 0755
          restartPolicy: OnFailure
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: istio-cert-checker
  namespace: istio-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: istio-cert-checker
rules:
- apiGroups: [""]
  resources: ["pods", "namespaces"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: istio-cert-checker
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: istio-cert-checker
subjects:
- kind: ServiceAccount
  name: istio-cert-checker
  namespace: istio-system


// Go implementation of certificate rotation verification
// File: cert-rotation-verifier.go
package main
import (
	"context"
	"crypto/x509"
	"encoding/pem"
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"time"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
)
const (
	// Minimum acceptable certificate lifetime
	minCertLifetime = 7 * 24 * time.Hour
	// Critical certificate lifetime threshold
	criticalCertLifetime = 2 * 24 * time.Hour
)
func main() {
	// Create Kubernetes client
	config, err := rest.InClusterConfig()
	if err != nil {
		log.Fatalf("Failed to create in-cluster config: %v", err)
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		log.Fatalf("Failed to create Kubernetes client: %v", err)
	}
	// Get all namespaces with Istio injection enabled
	namespaces, err := clientset.CoreV1().Namespaces().List(context.TODO(), metav1.ListOptions{
		LabelSelector: "istio-injection=enabled",
	})
	if err != nil {
		log.Fatalf("Failed to list namespaces: %v", err)
	}
	// Track certificate issues
	var issues []string
	// Check istiod certificates
	istiodIssues := checkIstiodCertificates(clientset)
	issues = append(issues, istiodIssues...)
	// Check workload certificates
	for _, ns := range namespaces.Items {
		namespace := ns.Name
		pods, err := clientset.CoreV1().Pods(namespace).List(context.TODO(), metav1.ListOptions{})
		if err != nil {
			log.Printf("Failed to list pods in namespace %s: %v", namespace, err)
			continue
		}
		for _, pod := range pods.Items {
			podIssues := checkPodCertificates(clientset, pod.Name, namespace)
			issues = append(issues, podIssues...)
		}
	}
	// Check certificate rotation logs
	rotationIssues := checkCertificateRotationLogs(clientset)
	issues = append(issues, rotationIssues...)
	// Report issues
	if len(issues) > 0 {
		log.Printf("Found %d certificate issues:", len(issues))
		for i, issue := range issues {
			log.Printf("%d. %s", i+1, issue)
		}
		os.Exit(1)
	} else {
		log.Println("No certificate issues found.")
	}
}
func checkIstiodCertificates(clientset *kubernetes.Clientset) []string {
	var issues []string
	// Get istiod pods
	pods, err := clientset.CoreV1().Pods("istio-system").List(context.TODO(), metav1.ListOptions{
		LabelSelector: "app=istiod",
	})
	if err != nil {
		issues = append(issues, fmt.Sprintf("Failed to list istiod pods: %v", err))
		return issues
	}
	for _, pod := range pods.Items {
		// Get certificate from istiod pod
		certBytes, err := execInPod(clientset, pod.Name, "istio-system", "cat /etc/certs/cert-chain.pem")
		if err != nil {
			issues = append(issues, fmt.Sprintf("Failed to get certificate from istiod pod %s: %v", pod.Name, err))
			continue
		}
		// Parse certificate
		block, _ := pem.Decode(certBytes)
		if block == nil {
			issues = append(issues, fmt.Sprintf("Failed to decode PEM certificate from istiod pod %s", pod.Name))
			continue
		}
		cert, err := x509.ParseCertificate(block.Bytes)
		if err != nil {
			issues = append(issues, fmt.Sprintf("Failed to parse certificate from istiod pod %s: %v", pod.Name, err))
			continue
		}
		// Check expiration
		timeUntilExpiry := time.Until(cert.NotAfter)
		if timeUntilExpiry < criticalCertLifetime {
			issues = append(issues, fmt.Sprintf("CRITICAL: Istiod certificate in pod %s will expire in %.1f hours", 
				pod.Name, timeUntilExpiry.Hours()))
		} else if timeUntilExpiry < minCertLifetime {
			issues = append(issues, fmt.Sprintf("WARNING: Istiod certificate in pod %s will expire in %.1f days", 
				pod.Name, timeUntilExpiry.Hours()/24))
		}
	}
	return issues
}
func checkPodCertificates(clientset *kubernetes.Clientset, podName, namespace string) []string {
	var issues []string
	// Check if pod has istio-proxy container
	pod, err := clientset.CoreV1().Pods(namespace).Get(context.TODO(), podName, metav1.GetOptions{})
	if err != nil {
		return issues
	}
	hasIstioProxy := false
	for _, container := range pod.Spec.Containers {
		if container.Name == "istio-proxy" {
			hasIstioProxy = true
			break
		}
	}
	if !hasIstioProxy {
		return issues
	}
	// Get certificate from istio-proxy container
	certBytes, err := execInPod(clientset, podName, namespace, "cat /etc/certs/cert-chain.pem")
	if err != nil {
		issues = append(issues, fmt.Sprintf("Failed to get certificate from pod %s/%s: %v", namespace, podName, err))
		return issues
	}
	// Parse certificate
	block, _ := pem.Decode(certBytes)
	if block == nil {
		issues = append(issues, fmt.Sprintf("Failed to decode PEM certificate from pod %s/%s", namespace, podName))
		return issues
	}
	cert, err := x509.ParseCertificate(block.Bytes)
	if err != nil {
		issues = append(issues, fmt.Sprintf("Failed to parse certificate from pod %s/%s: %v", namespace, podName, err))
		return issues
	}
	// Check expiration
	timeUntilExpiry := time.Until(cert.NotAfter)
	if timeUntilExpiry < criticalCertLifetime {
		issues = append(issues, fmt.Sprintf("CRITICAL: Certificate in pod %s/%s will expire in %.1f hours", 
			namespace, podName, timeUntilExpiry.Hours()))
	} else if timeUntilExpiry < minCertLifetime {
		issues = append(issues, fmt.Sprintf("WARNING: Certificate in pod %s/%s will expire in %.1f days", 
			namespace, podName, timeUntilExpiry.Hours()/24))
	}
	return issues
}
func checkCertificateRotationLogs(clientset *kubernetes.Clientset) []string {
	var issues []string
	// Get istiod pods
	pods, err := clientset.CoreV1().Pods("istio-system").List(context.TODO(), metav1.ListOptions{
		LabelSelector: "app=istiod",
	})
	if err != nil {
		issues = append(issues, fmt.Sprintf("Failed to list istiod pods: %v", err))
		return issues
	}
	for _, pod := range pods.Items {
		// Get logs from istiod pod
		logs, err := getPodLogs(clientset, pod.Name, "istio-system")
		if err != nil {
			issues = append(issues, fmt.Sprintf("Failed to get logs from istiod pod %s: %v", pod.Name, err))
			continue
		}
		// Check for certificate rotation errors
		if containsString(logs, "Failed to sign CSR") {
			issues = append(issues, fmt.Sprintf("Certificate rotation failures detected in istiod pod %s", pod.Name))
		}
		if containsString(logs, "Error rotating certificate") {
			issues = append(issues, fmt.Sprintf("Certificate rotation errors detected in istiod pod %s", pod.Name))
		}
		if containsString(logs, "permission denied") && containsString(logs, "certificate") {
			issues = append(issues, fmt.Sprintf("Certificate permission issues detected in istiod pod %s", pod.Name))
		}
	}
	return issues
}
// Helper functions
func execInPod(clientset *kubernetes.Clientset, podName, namespace, command string) ([]byte, error) {
	// Implementation omitted for brevity
	// This would use the Kubernetes API to execute a command in a pod
	return []byte{}, nil
}
func getPodLogs(clientset *kubernetes.Clientset, podName, namespace string) (string, error) {
	// Implementation omitted for brevity
	// This would use the Kubernetes API to get logs from a pod
	return "", nil
}
func containsString(s, substr string) bool {
	return true // Simplified for brevity
}

Lessons Learned:

Certificate management in service meshes requires proactive monitoring and alerting to prevent outages due to expiration.

How to Avoid:

Implement certificate expiration monitoring and alerting.
Configure proper logging for certificate rotation operations.
Regularly audit certificate management permissions.
Create automated tests to verify certificate rotation functionality.
Establish clear incident response procedures for certificate-related issues.

Answer 11

output:

API Gateway and Service Mesh Kubernetes, Istio, Production environment

Summary:

No summary provided

What Happened:

A large financial services company used Istio as their service mesh for securing service-to-service communication in their Kubernetes environment. All internal communication was encrypted using mTLS with certificates managed by the mesh's certificate authority. During a weekend, multiple services began experiencing connection failures, and by Monday morning, the entire platform was effectively down. Investigation revealed that the root certificates used by the service mesh had expired, and the automatic rotation mechanism had silently failed weeks earlier. The incident caused a complete production outage requiring manual intervention to restore service.

Diagnosis Steps:

Analyzed connection errors in service logs.
Examined certificate expiration dates across the mesh.
Reviewed certificate rotation logs and configuration.
Checked the status of the certificate authority components.
Tested certificate issuance in a controlled environment.

Root Cause:

The investigation revealed multiple issues with the certificate management: 1. The certificate rotation job had been failing silently for weeks 2. Monitoring for certificate expiration was not implemented 3. The certificate authority's storage was corrupted due to a previous incident 4. The mesh was configured with a short certificate lifetime but no safety margin 5. There was no documented procedure for manual certificate rotation

Fix/Workaround:

• Implemented immediate fix to restore service

• Generated new root certificates with extended validity

• Forced rotation of all service certificates

• Implemented monitoring for certificate expiration

• Created runbooks for manual certificate rotation

Lessons Learned:

Certificate management in service meshes requires robust monitoring, alerting, and fallback procedures.

How to Avoid:

Implement monitoring for certificate expiration with adequate warning time.
Configure certificate rotation with appropriate safety margins.
Test certificate rotation procedures regularly.
Create documented procedures for manual certificate rotation.
Implement alerting for certificate rotation failures.

# API Gateway and Service Mesh Scenarios

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

Example of proper Kong JWT plugin configuration

Properly configured claims validation

Multiple signature verification algorithms

Proper key configuration

Comprehensive validation settings

Proper error handling

Enforce token format

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid:

What Happened:

Diagnosis Steps:

Root Cause:

Fix/Workaround:

Lessons Learned:

How to Avoid: