After implementing Kubernetes Network Policies to enhance security, several microservices began experiencing communication failures. Some services could not reach their dependencies despite configurations that should have allowed the traffic.
# Network Security Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Examined the Network Policy definitions with
kubectl get networkpolicies -A -o yaml
.Tested connectivity between pods using debug containers.
Analyzed Calico logs for policy enforcement decisions.
Reviewed service communication patterns and required access paths.
Checked for namespace isolation and cross-namespace policies.
Root Cause:
The Network Policies were configured using pod labels for selection, but some services were communicating using the Kubernetes Service abstraction. The policies didn't account for the fact that traffic from a Service appears to come from the cluster's internal IP range, not directly from the originating pods.
Fix/Workaround:
• Short-term: Modified the Network Policies to allow traffic from the cluster's internal CIDR:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-access
namespace: backend
spec:
podSelector:
matchLabels:
app: api-service
policyTypes:
- Ingress
ingress:
- from:
# Allow traffic from frontend pods
- namespaceSelector:
matchLabels:
name: frontend
podSelector:
matchLabels:
app: web-ui
# Allow traffic from Kubernetes Services
- ipBlock:
cidr: 10.96.0.0/12 # Cluster service CIDR
ports:
- protocol: TCP
port: 8080
• Long-term: Implemented a more comprehensive network security model:
# Base deny-all policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: backend
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Service-specific ingress policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-service-policy
namespace: backend
spec:
podSelector:
matchLabels:
app: api-service
policyTypes:
- Ingress
ingress:
- from:
# Allow traffic from specific namespaces
- namespaceSelector:
matchLabels:
name: frontend
# Allow traffic from specific pods
- namespaceSelector:
matchLabels:
name: backend
podSelector:
matchLabels:
app: auth-service
# Allow traffic from Kubernetes Services
- ipBlock:
cidr: 10.96.0.0/12
ports:
- protocol: TCP
port: 8080
---
# Service-specific egress policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-service-egress
namespace: backend
spec:
podSelector:
matchLabels:
app: api-service
policyTypes:
- Egress
egress:
- to:
# Allow traffic to database
- namespaceSelector:
matchLabels:
name: database
podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to:
# Allow DNS resolution
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
• Added network policy validation in CI/CD:
#!/bin/bash
# validate_network_policies.sh
set -euo pipefail
# Function to check if a policy allows required communication
check_communication() {
local src_namespace=$1
local src_app=$2
local dst_namespace=$3
local dst_app=$4
local dst_port=$5
echo "Checking if $src_app in $src_namespace can access $dst_app in $dst_namespace on port $dst_port"
# Create test pods
kubectl run src-test --namespace=$src_namespace --labels=app=$src_app --image=busybox --restart=Never -- sleep 3600
kubectl run dst-test --namespace=$dst_namespace --labels=app=$dst_app --image=nginx --restart=Never --expose --port=$dst_port
# Wait for pods to be ready
kubectl wait --for=condition=Ready pod/src-test --namespace=$src_namespace --timeout=60s
kubectl wait --for=condition=Ready pod/dst-test --namespace=$dst_namespace --timeout=60s
# Get service IP
dst_svc_ip=$(kubectl get service dst-test --namespace=$dst_namespace -o jsonpath='{.spec.clusterIP}')
# Test connectivity
result=$(kubectl exec src-test --namespace=$src_namespace -- wget -T 5 -O- http://$dst_svc_ip:$dst_port 2>/dev/null || echo "FAILED")
# Clean up
kubectl delete pod src-test --namespace=$src_namespace
kubectl delete pod,service dst-test --namespace=$dst_namespace
if [[ $result == *"FAILED"* ]]; then
echo "❌ Communication test failed"
return 1
else
echo "✅ Communication test passed"
return 0
fi
}
# Run tests for critical communication paths
check_communication "frontend" "web-ui" "backend" "api-service" 8080
check_communication "backend" "api-service" "database" "postgres" 5432
check_communication "monitoring" "prometheus" "backend" "api-service" 9090
Lessons Learned:
Kubernetes Network Policies require careful consideration of all traffic patterns, including service abstractions.
How to Avoid:
Document all required communication paths before implementing network policies.
Test policies in a non-production environment first.
Implement policies incrementally, starting with monitoring mode.
Consider using a service mesh for more granular traffic control.
Regularly validate network policies against communication requirements.
No summary provided
What Happened:
During a security audit, the team discovered several overly permissive security group rules that exposed internal services to the public internet. Additionally, troubleshooting network connectivity issues had become extremely difficult due to the large number of overlapping and redundant rules.
Diagnosis Steps:
Exported all security group configurations with AWS CLI.
Analyzed ingress and egress rules for overly permissive settings.
Mapped security group dependencies and usage.
Reviewed CloudTrail logs for security group modifications.
Identified unused and redundant rules.
Root Cause:
The security groups had been managed manually over time, with engineers adding new rules as needed but rarely removing old ones. There was no process for reviewing security group changes, and infrastructure as code was not consistently used for network security configurations.
Fix/Workaround:
• Short-term: Removed the most critical overly permissive rules:
# Identify and remove dangerous rules
aws ec2 revoke-security-group-ingress \
--group-id sg-0123456789abcdef0 \
--protocol all \
--cidr 0.0.0.0/0
# Replace with more specific rules
aws ec2 authorize-security-group-ingress \
--group-id sg-0123456789abcdef0 \
--protocol tcp \
--port 443 \
--cidr 10.0.0.0/8
• Long-term: Implemented security groups as code with Terraform:
# Define security groups with clear naming and documentation
resource "aws_security_group" "web_tier" {
name = "web-tier-sg"
description = "Security group for web tier instances"
vpc_id = aws_vpc.main.id
tags = {
Name = "web-tier-sg"
Environment = "production"
ManagedBy = "terraform"
}
}
# Define ingress rules with clear purpose comments
resource "aws_security_group_rule" "web_tier_http" {
security_group_id = aws_security_group.web_tier.id
type = "ingress"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow HTTP from internet for web traffic"
}
resource "aws_security_group_rule" "web_tier_https" {
security_group_id = aws_security_group.web_tier.id
type = "ingress"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow HTTPS from internet for web traffic"
}
# Define app tier with reference to web tier for access
resource "aws_security_group" "app_tier" {
name = "app-tier-sg"
description = "Security group for application tier instances"
vpc_id = aws_vpc.main.id
tags = {
Name = "app-tier-sg"
Environment = "production"
ManagedBy = "terraform"
}
}
resource "aws_security_group_rule" "app_from_web" {
security_group_id = aws_security_group.app_tier.id
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
source_security_group_id = aws_security_group.web_tier.id
description = "Allow traffic from web tier to app API"
}
• Implemented automated security group auditing:
#!/usr/bin/env python3
# security_group_audit.py
import boto3
import json
import csv
from datetime import datetime
def audit_security_groups():
ec2 = boto3.client('ec2')
response = ec2.describe_security_groups()
risky_rules = []
unused_groups = []
# Get all network interfaces to check usage
eni_response = ec2.describe_network_interfaces()
used_sg_ids = set()
for eni in eni_response['NetworkInterfaces']:
for sg in eni['Groups']:
used_sg_ids.add(sg['GroupId'])
# Audit each security group
for sg in response['SecurityGroups']:
sg_id = sg['GroupId']
sg_name = sg['GroupName']
# Check if unused
if sg_id not in used_sg_ids and sg_name != 'default':
unused_groups.append({
'SecurityGroupId': sg_id,
'SecurityGroupName': sg_name,
'VpcId': sg.get('VpcId', 'N/A')
})
# Check for risky rules
for rule in sg.get('IpPermissions', []):
# Check for overly permissive rules
for ip_range in rule.get('IpRanges', []):
cidr = ip_range.get('CidrIp', '')
if cidr == '0.0.0.0/0':
from_port = rule.get('FromPort', -1)
to_port = rule.get('ToPort', -1)
protocol = rule.get('IpProtocol', '-1')
# All traffic or sensitive ports
if protocol == '-1' or from_port in [22, 3389] or (from_port <= 1024 and to_port >= 1024):
risky_rules.append({
'SecurityGroupId': sg_id,
'SecurityGroupName': sg_name,
'VpcId': sg.get('VpcId', 'N/A'),
'Protocol': protocol,
'PortRange': f"{from_port}-{to_port}" if from_port != -1 else "All",
'CidrIp': cidr,
'Description': ip_range.get('Description', 'No description')
})
# Write results to CSV files
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
with open(f'risky_rules_{timestamp}.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=['SecurityGroupId', 'SecurityGroupName', 'VpcId', 'Protocol', 'PortRange', 'CidrIp', 'Description'])
writer.writeheader()
writer.writerows(risky_rules)
with open(f'unused_groups_{timestamp}.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=['SecurityGroupId', 'SecurityGroupName', 'VpcId'])
writer.writeheader()
writer.writerows(unused_groups)
print(f"Found {len(risky_rules)} risky rules and {len(unused_groups)} unused security groups")
print(f"Results written to risky_rules_{timestamp}.csv and unused_groups_{timestamp}.csv")
if __name__ == "__main__":
audit_security_groups()
Lessons Learned:
Security group management requires a structured approach with regular auditing.
How to Avoid:
Manage all security groups through infrastructure as code.
Implement a review process for security group changes.
Regularly audit security groups for unused or overly permissive rules.
Use security group references instead of CIDR blocks where possible.
Document the purpose of each security group and rule.
No summary provided
What Happened:
Security monitoring detected unusual network traffic between containers that should have been isolated by Kubernetes NetworkPolicy resources. Further investigation revealed that an attacker had exploited a zero-day vulnerability in the container runtime to bypass network isolation and move laterally between containers.
Diagnosis Steps:
Analyzed network traffic logs to identify the unusual communication patterns.
Reviewed Kubernetes NetworkPolicy configurations to confirm they were correctly defined.
Examined container runtime logs for suspicious activities.
Performed forensic analysis on affected containers.
Tested network isolation in a controlled environment to reproduce the issue.
Root Cause:
A zero-day vulnerability (CVE-2022-XXXXX) in the containerd runtime allowed processes with specific capabilities to manipulate network namespaces, effectively bypassing the network isolation enforced by Kubernetes NetworkPolicies. The vulnerability was present in containerd versions prior to 1.5.9.
Fix/Workaround:
• Short-term: Implemented additional network security layers:
# Istio AuthorizationPolicy for additional network security
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: strict-service-isolation
namespace: production
spec:
selector:
matchLabels:
app: critical-service
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/frontend/sa/frontend-service"]
to:
- operation:
methods: ["GET"]
paths: ["/api/v1/public/*"]
- from:
- source:
namespaces: ["monitoring"]
to:
- operation:
ports: ["9090"]
• Implemented host-level firewall rules as an additional layer of protection:
#!/bin/bash
# Additional host-level firewall rules
# Get all pod CIDRs
POD_CIDRS=$(kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}')
# Set up default deny rules for pod networks
for CIDR in $POD_CIDRS; do
iptables -A FORWARD -d $CIDR -j DROP
iptables -A FORWARD -s $CIDR -j DROP
done
# Allow specific pod-to-pod communication based on service requirements
# Format: source_namespace/source_service -> destination_namespace/destination_service
ALLOWED_ROUTES=(
"frontend/web-app:backend/api-service:tcp:8080"
"backend/api-service:database/postgres:tcp:5432"
"monitoring/prometheus:*/*:tcp:9090"
)
for ROUTE in "${ALLOWED_ROUTES[@]}"; do
SRC=$(echo $ROUTE | cut -d':' -f1)
DST=$(echo $ROUTE | cut -d':' -f2)
PROTO=$(echo $ROUTE | cut -d':' -f3)
PORT=$(echo $ROUTE | cut -d':' -f4)
SRC_NS=$(echo $SRC | cut -d'/' -f1)
SRC_SVC=$(echo $SRC | cut -d'/' -f2)
DST_NS=$(echo $DST | cut -d'/' -f1)
DST_SVC=$(echo $DST | cut -d'/' -f2)
# Get pod IPs for source and destination
if [ "$SRC_SVC" == "*" ]; then
SRC_IPS=$(kubectl get pods -n $SRC_NS -o jsonpath='{.items[*].status.podIP}')
else
SRC_IPS=$(kubectl get pods -n $SRC_NS -l app=$SRC_SVC -o jsonpath='{.items[*].status.podIP}')
fi
if [ "$DST_SVC" == "*" ]; then
DST_IPS=$(kubectl get pods -n $DST_NS -o jsonpath='{.items[*].status.podIP}')
else
DST_IPS=$(kubectl get pods -n $DST_NS -l app=$DST_SVC -o jsonpath='{.items[*].status.podIP}')
fi
# Create iptables rules for each source-destination pair
for SRC_IP in $SRC_IPS; do
for DST_IP in $DST_IPS; do
iptables -A FORWARD -s $SRC_IP -d $DST_IP -p $PROTO --dport $PORT -j ACCEPT
done
done
done
# Save iptables rules
iptables-save > /etc/iptables/rules.v4
• Long-term: Upgraded containerd to the patched version and implemented a comprehensive container security strategy:
# Updated DaemonSet for containerd upgrade
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: containerd-upgrade
namespace: kube-system
spec:
selector:
matchLabels:
name: containerd-upgrade
template:
metadata:
labels:
name: containerd-upgrade
spec:
hostPID: true
hostNetwork: true
containers:
- name: containerd-upgrade
image: company/containerd-upgrade:1.5.9
securityContext:
privileged: true
volumeMounts:
- name: host-root
mountPath: /host
command:
- /bin/sh
- -c
- |
set -ex
# Backup current containerd
cp /host/usr/bin/containerd /host/usr/bin/containerd.bak
# Install new containerd
cp /usr/local/bin/containerd /host/usr/bin/containerd
# Restart containerd service
chroot /host systemctl restart containerd
# Verify upgrade
chroot /host containerd --version
# Keep the pod running for verification
sleep 3600
volumes:
- name: host-root
hostPath:
path: /
• Implemented a network security monitoring solution using eBPF:
// network_monitor.go
package main
import (
"bytes"
"encoding/binary"
"encoding/json"
"fmt"
"log"
"net"
"os"
"os/signal"
"strings"
"time"
"github.com/cilium/ebpf"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/perf"
"golang.org/x/sys/unix"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang NetworkMonitor ./bpf/network_monitor.c -- -I./bpf/headers
type ConnectionEvent struct {
SrcIP [16]byte
DstIP [16]byte
SrcPort uint16
DstPort uint16
Protocol uint8
PID uint32
UID uint32
Allowed uint8
}
type ConnectionInfo struct {
SourceIP string `json:"source_ip"`
DestinationIP string `json:"destination_ip"`
SourcePort uint16 `json:"source_port"`
DestinationPort uint16 `json:"destination_port"`
Protocol string `json:"protocol"`
PID uint32 `json:"pid"`
UID uint32 `json:"uid"`
Allowed bool `json:"allowed"`
Timestamp time.Time `json:"timestamp"`
SourcePod string `json:"source_pod,omitempty"`
SourceNamespace string `json:"source_namespace,omitempty"`
DestPod string `json:"destination_pod,omitempty"`
DestNamespace string `json:"destination_namespace,omitempty"`
ViolatesPolicy bool `json:"violates_policy,omitempty"`
}
func main() {
// Load pre-compiled BPF program
objs := NetworkMonitorObjects{}
if err := LoadNetworkMonitorObjects(&objs, nil); err != nil {
log.Fatalf("loading objects: %v", err)
}
defer objs.Close()
// Attach to tracepoints
tcpConnect, err := link.Tracepoint("sock", "inet_sock_set_state", objs.TraceTcpConnect)
if err != nil {
log.Fatalf("opening tracepoint: %v", err)
}
defer tcpConnect.Close()
udpSendmsg, err := link.Tracepoint("sock", "inet_sock_set_state", objs.TraceUdpSendmsg)
if err != nil {
log.Fatalf("opening tracepoint: %v", err)
}
defer udpSendmsg.Close()
// Set up perf buffer reader
rd, err := perf.NewReader(objs.Events, 4096)
if err != nil {
log.Fatalf("creating perf reader: %v", err)
}
defer rd.Close()
// Set up Kubernetes client
k8sClient, err := getKubernetesClient()
if err != nil {
log.Printf("Warning: Failed to create Kubernetes client: %v", err)
}
// Set up signal handler for clean exit
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt, unix.SIGTERM)
// Process events
go processEvents(rd, k8sClient)
<-sig
log.Println("Received signal, exiting...")
}
func processEvents(rd *perf.Reader, k8sClient *kubernetes.Clientset) {
for {
record, err := rd.Read()
if err != nil {
if err == perf.ErrClosed {
return
}
log.Printf("Error reading from perf buffer: %v", err)
continue
}
if record.LostSamples != 0 {
log.Printf("Lost %d samples", record.LostSamples)
continue
}
var event ConnectionEvent
if err := binary.Read(bytes.NewReader(record.RawSample), binary.LittleEndian, &event); err != nil {
log.Printf("Error parsing event: %v", err)
continue
}
// Convert event to connection info
connInfo := convertEvent(event)
// Enrich with Kubernetes metadata if available
if k8sClient != nil {
enrichWithK8sMetadata(k8sClient, &connInfo)
checkPolicyViolation(k8sClient, &connInfo)
}
// Log the connection
logConnection(connInfo)
}
}
func convertEvent(event ConnectionEvent) ConnectionInfo {
srcIP := net.IP(event.SrcIP[:])
dstIP := net.IP(event.DstIP[:])
// Trim trailing zeros for IPv4 addresses
if srcIP.To4() != nil {
srcIP = srcIP.To4()
}
if dstIP.To4() != nil {
dstIP = dstIP.To4()
}
protocol := "unknown"
switch event.Protocol {
case 6:
protocol = "TCP"
case 17:
protocol = "UDP"
}
return ConnectionInfo{
SourceIP: srcIP.String(),
DestinationIP: dstIP.String(),
SourcePort: event.SrcPort,
DestinationPort: event.DstPort,
Protocol: protocol,
PID: event.PID,
UID: event.UID,
Allowed: event.Allowed != 0,
Timestamp: time.Now(),
}
}
func getKubernetesClient() (*kubernetes.Clientset, error) {
// Try in-cluster config first
config, err := rest.InClusterConfig()
if err != nil {
return nil, fmt.Errorf("failed to create in-cluster config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
return nil, fmt.Errorf("failed to create Kubernetes client: %v", err)
}
return clientset, nil
}
func enrichWithK8sMetadata(client *kubernetes.Clientset, connInfo *ConnectionInfo) {
// Get all pods in all namespaces
pods, err := client.CoreV1().Pods("").List(context.Background(), metav1.ListOptions{})
if err != nil {
log.Printf("Error listing pods: %v", err)
return
}
// Find source pod
for _, pod := range pods.Items {
if pod.Status.PodIP == connInfo.SourceIP {
connInfo.SourcePod = pod.Name
connInfo.SourceNamespace = pod.Namespace
break
}
}
// Find destination pod
for _, pod := range pods.Items {
if pod.Status.PodIP == connInfo.DestinationIP {
connInfo.DestPod = pod.Name
connInfo.DestNamespace = pod.Namespace
break
}
}
}
func checkPolicyViolation(client *kubernetes.Clientset, connInfo *ConnectionInfo) {
// Skip if we don't have pod information
if connInfo.SourcePod == "" || connInfo.DestPod == "" {
return
}
// Get network policies in the destination namespace
netpols, err := client.NetworkingV1().NetworkPolicies(connInfo.DestNamespace).List(context.Background(), metav1.ListOptions{})
if err != nil {
log.Printf("Error listing network policies: %v", err)
return
}
// Check if the connection is allowed by any network policy
allowed := false
for _, netpol := range netpols.Items {
// Check if the policy applies to the destination pod
if !podMatchesSelector(client, connInfo.DestNamespace, connInfo.DestPod, netpol.Spec.PodSelector) {
continue
}
// Check ingress rules
for _, ingressRule := range netpol.Spec.Ingress {
// Check if the source pod is allowed
if podIsAllowedByIngress(client, connInfo.SourceNamespace, connInfo.SourcePod, ingressRule) {
// Check if the port is allowed
if portIsAllowed(connInfo.DestinationPort, connInfo.Protocol, ingressRule) {
allowed = true
break
}
}
}
if allowed {
break
}
}
// If no network policies apply to the destination pod, traffic is allowed by default
if len(netpols.Items) == 0 {
allowed = true
}
// Mark as violation if not allowed but the connection was established
if !allowed && connInfo.Allowed {
connInfo.ViolatesPolicy = true
}
}
func podMatchesSelector(client *kubernetes.Clientset, namespace, podName string, selector metav1.LabelSelector) bool {
// Implementation of pod label matching logic
return true // Simplified for this example
}
func podIsAllowedByIngress(client *kubernetes.Clientset, namespace, podName string, ingressRule networkingv1.NetworkPolicyIngressRule) bool {
// Implementation of ingress rule matching logic
return true // Simplified for this example
}
func portIsAllowed(port uint16, protocol string, ingressRule networkingv1.NetworkPolicyIngressRule) bool {
// Implementation of port matching logic
return true // Simplified for this example
}
func logConnection(connInfo ConnectionInfo) {
// Convert to JSON for structured logging
jsonData, err := json.Marshal(connInfo)
if err != nil {
log.Printf("Error marshaling connection info: %v", err)
return
}
// Log the connection
fmt.Println(string(jsonData))
// Alert on policy violations
if connInfo.ViolatesPolicy {
log.Printf("ALERT: Network policy violation detected: %s:%d -> %s:%d (%s)",
connInfo.SourceIP, connInfo.SourcePort,
connInfo.DestinationIP, connInfo.DestinationPort,
connInfo.Protocol)
}
}
• Implemented a comprehensive container security policy:
# PodSecurityPolicy to restrict container capabilities
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
fsGroup:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
readOnlyRootFilesystem: true
Lessons Learned:
Container runtime vulnerabilities can bypass Kubernetes network security controls.
How to Avoid:
Keep container runtimes updated with security patches.
Implement defense-in-depth with multiple layers of network security.
Use runtime security monitoring to detect unusual network activity.
Apply the principle of least privilege to container workloads.
Regularly audit and test network isolation between containers.
No summary provided
What Happened:
During a routine security audit, it was discovered that a misconfigured security group in AWS allowed unrestricted access to a database containing sensitive customer information. The misconfiguration went unnoticed for several weeks, during which unauthorized access was detected.
Diagnosis Steps:
Reviewed AWS security group configurations and audit logs.
Analyzed VPC flow logs for unusual traffic patterns.
Conducted a security scan to identify open ports and services.
Examined application logs for unauthorized access attempts.
Reviewed recent changes to security group rules and IAM policies.
Root Cause:
The security group associated with the database was configured with overly permissive rules: 1. An inbound rule allowed traffic from any IP address on port 5432 (PostgreSQL). 2. A recent change to the security group was not reviewed or approved by the security team. 3. Lack of monitoring and alerting for changes to security group configurations. 4. Insufficient network segmentation and isolation of sensitive resources.
Fix/Workaround:
• Short-term: Restrict security group rules and implement monitoring:
# Security group configuration
Resources:
MyDatabaseSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: "Security group for RDS database"
VpcId: !Ref MyVPC
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 5432
ToPort: 5432
CidrIp: 10.0.0.0/16 # Restrict to internal VPC range
SecurityGroupEgress:
- IpProtocol: -1
CidrIp: 0.0.0.0/0
• Implemented AWS Config rules for security group compliance:
// AWS Config rule for security group compliance
{
"ConfigRuleName": "restricted-security-groups",
"Description": "Ensure security groups do not allow unrestricted access to sensitive ports",
"Scope": {
"ComplianceResourceTypes": [
"AWS::EC2::SecurityGroup"
]
},
"Source": {
"Owner": "AWS",
"SourceIdentifier": "INCOMING_SSH_DISABLED"
},
"InputParameters": {
"cidrIp": "0.0.0.0/0",
"port": "5432"
}
}
• Long-term: Implemented a comprehensive network security strategy:
// network_security_monitor.go
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"os"
"time"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/ec2"
"github.com/aws/aws-sdk-go-v2/service/guardduty"
"github.com/aws/aws-sdk-go-v2/service/securityhub"
"github.com/aws/aws-sdk-go-v2/service/sns"
"github.com/aws/aws-sdk-go-v2/service/sns/types"
)
type SecurityConfig struct {
AWS struct {
Region string `yaml:"region"`
SNSTopicARN string `yaml:"snsTopicArn"`
GuardDutyDetectorID string `yaml:"guardDutyDetectorId"`
} `yaml:"aws"`
}
func main() {
// Load configuration
configFile, err := os.ReadFile("security_config.yaml")
if err != nil {
log.Fatalf("Failed to read config file: %v", err)
}
var config SecurityConfig
if err := json.Unmarshal(configFile, &config); err != nil {
log.Fatalf("Failed to parse config: %v", err)
}
// Initialize AWS SDK
cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(config.AWS.Region))
if err != nil {
log.Fatalf("Failed to load AWS config: %v", err)
}
// Create EC2 client
ec2Client := ec2.NewFromConfig(cfg)
// Create GuardDuty client
guardDutyClient := guardduty.NewFromConfig(cfg)
// Create SecurityHub client
securityHubClient := securityhub.NewFromConfig(cfg)
// Create SNS client
snsClient := sns.NewFromConfig(cfg)
// Monitor security groups
go monitorSecurityGroups(ec2Client, snsClient, config.AWS.SNSTopicARN)
// Monitor GuardDuty findings
go monitorGuardDutyFindings(guardDutyClient, snsClient, config.AWS.SNSTopicARN, config.AWS.GuardDutyDetectorID)
// Monitor SecurityHub findings
go monitorSecurityHubFindings(securityHubClient, snsClient, config.AWS.SNSTopicARN)
// Keep the main thread running
select {}
}
func monitorSecurityGroups(client *ec2.Client, snsClient *sns.Client, topicARN string) {
for {
// Describe security groups
output, err := client.DescribeSecurityGroups(context.TODO(), &ec2.DescribeSecurityGroupsInput{})
if err != nil {
log.Printf("Failed to describe security groups: %v", err)
time.Sleep(5 * time.Minute)
continue
}
// Check for overly permissive rules
for _, sg := range output.SecurityGroups {
for _, perm := range sg.IpPermissions {
for _, ipRange := range perm.IpRanges {
if *ipRange.CidrIp == "0.0.0.0/0" {
// Send alert
message := fmt.Sprintf("Security group %s (%s) has overly permissive rule: %s %d-%d %s",
*sg.GroupId, *sg.GroupName, *perm.IpProtocol, *perm.FromPort, *perm.ToPort, *ipRange.CidrIp)
sendAlert(snsClient, topicARN, message)
}
}
}
}
// Sleep before next check
time.Sleep(1 * time.Hour)
}
}
func monitorGuardDutyFindings(client *guardduty.Client, snsClient *sns.Client, topicARN, detectorID string) {
for {
// List GuardDuty findings
output, err := client.ListFindings(context.TODO(), &guardduty.ListFindingsInput{
DetectorId: &detectorID,
})
if err != nil {
log.Printf("Failed to list GuardDuty findings: %v", err)
time.Sleep(5 * time.Minute)
continue
}
// Describe findings
findings, err := client.GetFindings(context.TODO(), &guardduty.GetFindingsInput{
DetectorId: &detectorID,
FindingIds: output.FindingIds,
})
if err != nil {
log.Printf("Failed to get GuardDuty findings: %v", err)
time.Sleep(5 * time.Minute)
continue
}
// Send alerts for high severity findings
for _, finding := range findings.Findings {
if *finding.Severity >= 7.0 {
message := fmt.Sprintf("GuardDuty finding: %s - %s (Severity: %.1f)",
*finding.Title, *finding.Description, *finding.Severity)
sendAlert(snsClient, topicARN, message)
}
}
// Sleep before next check
time.Sleep(1 * time.Hour)
}
}
func monitorSecurityHubFindings(client *securityhub.Client, snsClient *sns.Client, topicARN string) {
for {
// List SecurityHub findings
output, err := client.GetFindings(context.TODO(), &securityhub.GetFindingsInput{})
if err != nil {
log.Printf("Failed to get SecurityHub findings: %v", err)
time.Sleep(5 * time.Minute)
continue
}
// Send alerts for critical findings
for _, finding := range output.Findings {
if *finding.Severity.Label == "CRITICAL" {
message := fmt.Sprintf("SecurityHub finding: %s - %s (Severity: %s)",
*finding.Title, *finding.Description, *finding.Severity.Label)
sendAlert(snsClient, topicARN, message)
}
}
// Sleep before next check
time.Sleep(1 * time.Hour)
}
}
func sendAlert(client *sns.Client, topicARN, message string) {
_, err := client.Publish(context.TODO(), &sns.PublishInput{
TopicArn: &topicARN,
Message: &message,
})
if err != nil {
log.Printf("Failed to send alert: %v", err)
}
}
• Created a network security checklist and runbook:
# Network Security Runbook: AWS Environment
## Security Group Management
### 1. Security Group Configuration
- [ ] Ensure security groups are configured with least privilege
- [ ] Restrict inbound traffic to known IP ranges
- [ ] Use VPC peering or VPN for secure internal communication
- [ ] Regularly review and update security group rules
- [ ] Implement AWS Config rules to monitor security group compliance
### 2. Monitoring and Alerting
- [ ] Enable VPC flow logs for all VPCs
- [ ] Set up CloudWatch alarms for unusual traffic patterns
- [ ] Use GuardDuty for threat detection and alerting
- [ ] Integrate SecurityHub for centralized security findings
- [ ] Configure SNS for alert notifications
### 3. Incident Response
- [ ] Define incident response procedures for security breaches
- [ ] Conduct regular security drills and simulations
- [ ] Maintain an up-to-date contact list for incident response team
- [ ] Document and review all security incidents
## Network Segmentation
### 1. VPC Design
- [ ] Design VPCs with appropriate subnets for public and private resources
- [ ] Use network ACLs to control traffic between subnets
- [ ] Implement VPC peering for secure cross-VPC communication
- [ ] Use Transit Gateway for centralized network management
### 2. Isolation of Sensitive Resources
- [ ] Isolate sensitive resources in dedicated VPCs or subnets
- [ ] Use security groups and network ACLs to restrict access
- [ ] Implement bastion hosts for secure access to private resources
- [ ] Use AWS PrivateLink for secure access to AWS services
## Compliance and Auditing
### 1. Compliance Monitoring
- [ ] Enable AWS Config for continuous compliance monitoring
- [ ] Use AWS Audit Manager for compliance assessments
- [ ] Regularly review compliance reports and address findings
- [ ] Implement automated remediation for common compliance issues
### 2. Auditing and Logging
- [ ] Enable CloudTrail for all AWS accounts
- [ ] Use AWS CloudWatch Logs for centralized log management
- [ ] Implement log retention policies according to compliance requirements
- [ ] Regularly review and analyze logs for security events
## Rollback Plan
### Triggers for Rollback
- Unauthorized access detected
- Critical security vulnerability identified
- Compliance violation discovered
### Rollback Procedure
1. Revoke all access to affected resources
2. Restore security group configurations from backup
3. Conduct a security review and implement additional controls
4. Notify all stakeholders of rollback and remediation actions
Lessons Learned:
Network security requires continuous monitoring and adherence to best practices to prevent unauthorized access.
How to Avoid:
Implement least privilege access for security groups.
Regularly review and update security configurations.
Use automated tools for monitoring and alerting.
Conduct regular security audits and drills.
Maintain a comprehensive network security runbook.
No summary provided
What Happened:
A security monitoring system triggered alerts about unexpected network traffic between containers in different namespaces that should have been isolated by network policies. The traffic was detected between a frontend application and a database that should only be accessible through an API service. This bypass of the intended security architecture raised concerns about potential data exfiltration or lateral movement by an attacker.
Diagnosis Steps:
Analyzed network flow logs to identify the specific pods involved.
Reviewed Kubernetes network policies applied to the affected namespaces.
Examined pod labels and selectors used in network policies.
Tested network connectivity between pods using debugging tools.
Reviewed recent changes to network policies and pod deployments.
Root Cause:
The investigation revealed multiple issues: 1. A recent deployment introduced pods with incorrect labels that didn't match network policy selectors 2. Some network policies were using overly permissive selectors 3. The Calico network policy controller had a configuration issue causing delayed policy enforcement 4. A custom admission controller that should validate network policy compliance was bypassed during an emergency deployment 5. The monitoring system detected the issue, but alerting thresholds were set too high, delaying notification
Fix/Workaround:
• Short-term: Implemented immediate fixes to correct pod labels and network policies:
# Before: Problematic pod deployment with incorrect labels
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend-app
namespace: frontend
spec:
replicas: 3
selector:
matchLabels:
app: frontend
tier: web
template:
metadata:
labels:
app: frontend
# Missing tier label that network policies depend on
spec:
containers:
- name: frontend
image: frontend:v1.2.3
ports:
- containerPort: 80
# After: Corrected pod deployment with proper labels
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend-app
namespace: frontend
spec:
replicas: 3
selector:
matchLabels:
app: frontend
tier: web
template:
metadata:
labels:
app: frontend
tier: web # Added missing label
spec:
containers:
- name: frontend
image: frontend:v1.2.3
ports:
- containerPort: 80
• Fixed overly permissive network policies:
# Before: Overly permissive network policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: db-access-policy
namespace: database
spec:
podSelector:
matchLabels:
app: postgres
ingress:
- from:
- namespaceSelector:
matchLabels:
environment: production
# Missing podSelector makes this allow all pods in production namespaces
- ports:
- protocol: TCP
port: 5432
# After: Properly restricted network policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: db-access-policy
namespace: database
spec:
podSelector:
matchLabels:
app: postgres
ingress:
- from:
- namespaceSelector:
matchLabels:
environment: production
podSelector:
matchLabels:
tier: api
app: backend
ports:
- protocol: TCP
port: 5432
• Implemented a Calico GlobalNetworkPolicy for defense-in-depth:
# Additional Calico GlobalNetworkPolicy for defense-in-depth
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
name: default-deny-between-namespaces
spec:
tier: default
order: 100
selector: all()
types:
- Ingress
- Egress
ingress:
- action: Deny
source:
namespaces:
notIn: ["kube-system", "calico-system"]
notSelector: node-role.kubernetes.io/control-plane == 'true'
destination:
namespaces:
notIn: ["kube-system", "calico-system"]
notSelector: node-role.kubernetes.io/control-plane == 'true'
- action: Pass
egress:
- action: Pass
• Implemented a network policy validation webhook in Go:
// networkpolicy_validator.go
package main
import (
"context"
"encoding/json"
"fmt"
"io/ioutil"
"log"
"net/http"
"strings"
admissionv1 "k8s.io/api/admission/v1"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
networkingv1 "k8s.io/api/networking/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/serializer"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
var (
runtimeScheme = runtime.NewScheme()
codecs = serializer.NewCodecFactory(runtimeScheme)
deserializer = codecs.UniversalDeserializer()
)
type WebhookServer struct {
clientset *kubernetes.Clientset
}
// Validate if a deployment has all required labels for network policies
func (whsvr *WebhookServer) validateDeployment(deployment *appsv1.Deployment) (bool, string) {
// Get the pod template labels
podLabels := deployment.Spec.Template.Labels
// Check if required labels exist
requiredLabels := []string{"app", "tier"}
missingLabels := []string{}
for _, label := range requiredLabels {
if _, exists := podLabels[label]; !exists {
missingLabels = append(missingLabels, label)
}
}
if len(missingLabels) > 0 {
return false, fmt.Sprintf("Deployment is missing required labels for network policies: %s", strings.Join(missingLabels, ", "))
}
// Check if the namespace has network policies that would apply to this deployment
netpols, err := whsvr.clientset.NetworkingV1().NetworkPolicies(deployment.Namespace).List(context.TODO(), metav1.ListOptions{})
if err != nil {
log.Printf("Error checking network policies: %v", err)
return true, "Warning: Could not verify network policy coverage"
}
// If there are no network policies in the namespace, warn about it
if len(netpols.Items) == 0 {
return true, "Warning: No network policies found in namespace. Pods will have unrestricted network access."
}
// Check if any network policy would select this deployment's pods
covered := false
for _, netpol := range netpols.Items {
selector, err := metav1.LabelSelectorAsSelector(&netpol.Spec.PodSelector)
if err != nil {
continue
}
// Create a set of labels from the pod template
podLabelSet := make(map[string]string)
for k, v := range podLabels {
podLabelSet[k] = v
}
// Check if the selector would match these labels
if selector.Empty() || selector.Matches(labels.Set(podLabelSet)) {
covered = true
break
}
}
if !covered {
return true, "Warning: Deployment pods are not covered by any network policy in the namespace"
}
return true, ""
}
// Validate if a network policy is properly restrictive
func (whsvr *WebhookServer) validateNetworkPolicy(netpol *networkingv1.NetworkPolicy) (bool, string) {
// Check for overly permissive ingress rules
for _, ingress := range netpol.Spec.Ingress {
if len(ingress.From) == 0 {
return false, "Network policy has an ingress rule that allows traffic from all sources"
}
for _, from := range ingress.From {
// Check for rules with namespaceSelector but no podSelector
if from.NamespaceSelector != nil && from.PodSelector == nil {
return false, "Network policy has an overly permissive ingress rule: namespaceSelector without podSelector allows all pods in the selected namespaces"
}
}
}
// Check for overly permissive egress rules
for _, egress := range netpol.Spec.Egress {
if len(egress.To) == 0 {
return false, "Network policy has an egress rule that allows traffic to all destinations"
}
}
return true, ""
}
// Main validation function
func (whsvr *WebhookServer) validate(ar *admissionv1.AdmissionReview) *admissionv1.AdmissionResponse {
req := ar.Request
// Determine the object type and validate accordingly
var (
valid bool
msg string
)
switch req.Kind.Kind {
case "Deployment":
var deployment appsv1.Deployment
if err := json.Unmarshal(req.Object.Raw, &deployment); err != nil {
return &admissionv1.AdmissionResponse{
Result: &metav1.Status{
Message: err.Error(),
},
Allowed: false,
}
}
valid, msg = whsvr.validateDeployment(&deployment)
case "NetworkPolicy":
var netpol networkingv1.NetworkPolicy
if err := json.Unmarshal(req.Object.Raw, &netpol); err != nil {
return &admissionv1.AdmissionResponse{
Result: &metav1.Status{
Message: err.Error(),
},
Allowed: false,
}
}
valid, msg = whsvr.validateNetworkPolicy(&netpol)
default:
// Skip validation for other types
return &admissionv1.AdmissionResponse{Allowed: true}
}
if valid {
// If there's a warning message but it's still valid
if msg != "" {
return &admissionv1.AdmissionResponse{
Allowed: true,
Warnings: []string{msg},
}
}
return &admissionv1.AdmissionResponse{Allowed: true}
}
return &admissionv1.AdmissionResponse{
Result: &metav1.Status{
Message: msg,
},
Allowed: false,
}
}
// Serve HTTP
func (whsvr *WebhookServer) serve(w http.ResponseWriter, r *http.Request) {
var body []byte
if r.Body != nil {
if data, err := ioutil.ReadAll(r.Body); err == nil {
body = data
}
}
// Verify the content type is accurate
contentType := r.Header.Get("Content-Type")
if contentType != "application/json" {
log.Printf("Content-Type=%s, expected application/json", contentType)
http.Error(w, "invalid Content-Type, expected application/json", http.StatusUnsupportedMediaType)
return
}
var admissionResponse *admissionv1.AdmissionResponse
ar := admissionv1.AdmissionReview{}
if _, _, err := deserializer.Decode(body, nil, &ar); err != nil {
log.Printf("Can't decode body: %v", err)
admissionResponse = &admissionv1.AdmissionResponse{
Result: &metav1.Status{
Message: err.Error(),
},
Allowed: false,
}
} else {
admissionResponse = whsvr.validate(&ar)
}
admissionReview := admissionv1.AdmissionReview{
TypeMeta: metav1.TypeMeta{
APIVersion: "admission.k8s.io/v1",
Kind: "AdmissionReview",
},
}
if admissionResponse != nil {
admissionReview.Response = admissionResponse
if ar.Request != nil {
admissionReview.Response.UID = ar.Request.UID
}
}
resp, err := json.Marshal(admissionReview)
if err != nil {
log.Printf("Can't encode response: %v", err)
http.Error(w, fmt.Sprintf("could not encode response: %v", err), http.StatusInternalServerError)
}
log.Printf("Ready to write response...")
if _, err := w.Write(resp); err != nil {
log.Printf("Can't write response: %v", err)
http.Error(w, fmt.Sprintf("could not write response: %v", err), http.StatusInternalServerError)
}
}
func main() {
// Create Kubernetes client
config, err := rest.InClusterConfig()
if err != nil {
log.Fatalf("Error getting cluster config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Fatalf("Error creating Kubernetes client: %v", err)
}
whsvr := &WebhookServer{
clientset: clientset,
}
// Define HTTP server and routes
mux := http.NewServeMux()
mux.HandleFunc("/validate", whsvr.serve)
server := &http.Server{
Addr: ":8443",
Handler: mux,
}
log.Printf("Starting webhook server on :8443")
if err := server.ListenAndServeTLS("/etc/webhook/certs/tls.crt", "/etc/webhook/certs/tls.key"); err != nil {
log.Fatalf("Error starting server: %v", err)
}
}
• Implemented a network policy auditing tool in Rust:
// network_policy_auditor.rs
use anyhow::{anyhow, Context, Result};
use futures::StreamExt;
use k8s_openapi::api::networking::v1::{NetworkPolicy, NetworkPolicySpec};
use kube::{
api::{Api, ListParams, ResourceExt},
client::Client,
runtime::watcher,
};
use serde::Serialize;
use std::{collections::HashMap, sync::Arc, time::Duration};
use tokio::sync::Mutex;
#[derive(Debug, Serialize)]
struct NetworkPolicyAudit {
namespace: String,
name: String,
issues: Vec<String>,
severity: String,
}
#[derive(Debug, Serialize)]
struct NamespaceAudit {
namespace: String,
has_default_deny: bool,
has_any_policy: bool,
pod_coverage_percentage: f64,
}
#[derive(Debug, Default)]
struct AuditState {
namespace_audits: HashMap<String, NamespaceAudit>,
policy_audits: Vec<NetworkPolicyAudit>,
}
async fn audit_network_policies(client: Client) -> Result<AuditState> {
let mut state = AuditState::default();
let netpols: Api<NetworkPolicy> = Api::all(client.clone());
let pods = Api::all(client.clone());
// Get all network policies
let policies = netpols.list(&ListParams::default()).await?;
// Track namespaces with policies
let mut namespaces_with_policies = HashMap::new();
// Audit each policy
for policy in policies {
let ns = policy.namespace().unwrap_or_else(|| "default".to_string());
let name = policy.name_any();
// Track namespaces with policies
namespaces_with_policies
.entry(ns.clone())
.or_insert_with(Vec::new)
.push(policy.clone());
// Audit the policy
let mut issues = Vec::new();
let mut severity = "low";
// Check for overly permissive ingress rules
if let Some(spec) = &policy.spec {
if let Some(ingress) = &spec.ingress {
for (i, rule) in ingress.iter().enumerate() {
if rule.from.is_none() || rule.from.as_ref().unwrap().is_empty() {
issues.push(format!("Ingress rule #{} allows traffic from all sources", i+1));
severity = "high";
} else {
for (j, from) in rule.from.as_ref().unwrap().iter().enumerate() {
if from.namespace_selector.is_some() && from.pod_selector.is_none() {
issues.push(format!(
"Ingress rule #{}, from #{} has namespaceSelector without podSelector",
i+1, j+1
));
severity = "medium";
}
}
}
}
}
// Check for overly permissive egress rules
if let Some(egress) = &spec.egress {
for (i, rule) in egress.iter().enumerate() {
if rule.to.is_none() || rule.to.as_ref().unwrap().is_empty() {
issues.push(format!("Egress rule #{} allows traffic to all destinations", i+1));
severity = "high";
}
}
}
// Check if this is a default deny policy
let is_default_deny = match (&spec.pod_selector, &spec.ingress, &spec.egress, &spec.policy_types) {
// Empty pod selector with no ingress/egress rules and deny policy types
(selector, None, None, Some(types)) if selector.match_labels.is_none() && selector.match_expressions.is_none() => {
types.contains(&"Ingress".to_string()) || types.contains(&"Egress".to_string())
}
// Empty pod selector with empty ingress/egress rules and deny policy types
(selector, Some(ingress), Some(egress), Some(types))
if selector.match_labels.is_none() && selector.match_expressions.is_none()
&& ingress.is_empty() && egress.is_empty() => {
types.contains(&"Ingress".to_string()) || types.contains(&"Egress".to_string())
}
_ => false,
};
if is_default_deny {
// Update namespace audit
state.namespace_audits.entry(ns.clone()).or_insert_with(|| NamespaceAudit {
namespace: ns.clone(),
has_default_deny: true,
has_any_policy: true,
pod_coverage_percentage: 0.0,
}).has_default_deny = true;
}
}
// Add to audit results if there are issues
if !issues.is_empty() {
state.policy_audits.push(NetworkPolicyAudit {
namespace: ns,
name,
issues,
severity: severity.to_string(),
});
}
}
// Get all namespaces
let namespaces = client
.list_core_v1_namespace(&ListParams::default())
.await?
.items;
// Audit each namespace
for ns in namespaces {
let ns_name = ns.metadata.name.unwrap_or_else(|| "default".to_string());
// Skip system namespaces
if ns_name.starts_with("kube-") || ns_name == "calico-system" {
continue;
}
let has_any_policy = namespaces_with_policies.contains_key(&ns_name);
let has_default_deny = state.namespace_audits
.get(&ns_name)
.map(|audit| audit.has_default_deny)
.unwrap_or(false);
// Calculate pod coverage
let pod_coverage = calculate_pod_coverage(&client, &ns_name, &namespaces_with_policies).await?;
// Update or create namespace audit
state.namespace_audits.insert(ns_name.clone(), NamespaceAudit {
namespace: ns_name,
has_default_deny,
has_any_policy,
pod_coverage_percentage: pod_coverage,
});
}
Ok(state)
}
async fn calculate_pod_coverage(
client: &Client,
namespace: &str,
policies_by_namespace: &HashMap<String, Vec<NetworkPolicy>>,
) -> Result<f64> {
// Get all pods in the namespace
let pods = Api::<k8s_openapi::api::core::v1::Pod>::namespaced(client.clone(), namespace)
.list(&ListParams::default())
.await?;
if pods.items.is_empty() {
return Ok(100.0); // No pods to cover
}
let policies = policies_by_namespace.get(namespace).cloned().unwrap_or_default();
let mut covered_pods = 0;
for pod in &pods.items {
let pod_labels = match &pod.metadata.labels {
Some(labels) => labels.clone(),
None => HashMap::new(),
};
// Check if any policy covers this pod
let is_covered = policies.iter().any(|policy| {
if let Some(spec) = &policy.spec {
if let Some(selector) = &spec.pod_selector {
// Check if the selector matches the pod labels
matches_labels(&pod_labels, selector)
} else {
false
}
} else {
false
}
});
if is_covered {
covered_pods += 1;
}
}
Ok((covered_pods as f64 / pods.items.len() as f64) * 100.0)
}
fn matches_labels(
pod_labels: &HashMap<String, String>,
selector: &k8s_openapi::apimachinery::pkg::apis::meta::v1::LabelSelector,
) -> bool {
// Check matchLabels
if let Some(match_labels) = &selector.match_labels {
for (key, value) in match_labels {
if !pod_labels.get(key).map_or(false, |v| v == value) {
return false;
}
}
}
// Check matchExpressions (simplified)
if let Some(expressions) = &selector.match_expressions {
for expr in expressions {
let pod_value = pod_labels.get(&expr.key);
match expr.operator.as_str() {
"In" => {
if let Some(values) = &expr.values {
if pod_value.map_or(true, |v| !values.contains(v)) {
return false;
}
}
}
"NotIn" => {
if let Some(values) = &expr.values {
if pod_value.map_or(false, |v| values.contains(v)) {
return false;
}
}
}
"Exists" => {
if pod_value.is_none() {
return false;
}
}
"DoesNotExist" => {
if pod_value.is_some() {
return false;
}
}
_ => return false,
}
}
}
true
}
#[tokio::main]
async fn main() -> Result<()> {
// Initialize Kubernetes client
let client = Client::try_default().await?;
// Run initial audit
let state = audit_network_policies(client.clone()).await?;
// Print audit results
println!("Network Policy Audit Results:");
println!("============================");
// Print namespace audits
println!("\nNamespace Audits:");
for (_, audit) in &state.namespace_audits {
println!(
"Namespace: {}, Default Deny: {}, Any Policy: {}, Pod Coverage: {:.1}%",
audit.namespace, audit.has_default_deny, audit.has_any_policy, audit.pod_coverage_percentage
);
}
// Print policy audits
println!("\nPolicy Issues:");
for audit in &state.policy_audits {
println!(
"Policy: {}/{} (Severity: {})",
audit.namespace, audit.name, audit.severity
);
for issue in &audit.issues {
println!(" - {}", issue);
}
}
// Recommendations
println!("\nRecommendations:");
for (_, audit) in &state.namespace_audits {
if !audit.has_any_policy {
println!("- Namespace {} has no network policies. Consider adding a default deny policy.", audit.namespace);
} else if !audit.has_default_deny {
println!("- Namespace {} has policies but no default deny. Consider adding a default deny policy.", audit.namespace);
}
if audit.pod_coverage_percentage < 100.0 {
println!(
"- Namespace {} has only {:.1}% of pods covered by network policies.",
audit.namespace, audit.pod_coverage_percentage
);
}
}
Ok(())
}
• Long-term: Implemented a comprehensive network security strategy:
- Created a network policy validation and enforcement framework
- Implemented automated network policy testing and verification
- Added network flow monitoring and anomaly detection
- Documented best practices for Kubernetes network security
- Implemented regular network security audits
Lessons Learned:
Network policies require careful management and validation to ensure proper security boundaries.
How to Avoid:
Implement strict validation for pod labels and network policies.
Use default-deny policies in all namespaces.
Regularly audit network policy coverage and effectiveness.
Implement network flow monitoring and anomaly detection.
Test network policies with security scanning tools.
No summary provided
What Happened:
During a routine security audit, the security team discovered unexpected network traffic between containers in different namespaces that should have been isolated by network policies. The issue was particularly concerning because it allowed communication between production and development environments, potentially exposing sensitive data. The network policies appeared to be correctly configured, but they weren't being enforced as expected.
Diagnosis Steps:
Analyzed network traffic patterns using Calico flow logs.
Reviewed network policy configurations across all namespaces.
Examined pod labels and namespace selectors in network policies.
Tested network connectivity between pods using debugging tools.
Reviewed recent changes to the cluster configuration and CNI settings.
Root Cause:
The investigation revealed multiple issues: 1. Some pods were using host network mode, bypassing network policies entirely 2. A sidecar container was using a different network namespace than its primary container 3. Network policies were correctly configured but the CNI plugin had a bug in selector matching 4. Custom admission controllers were modifying pod labels after network policy evaluation 5. Some pods were communicating through a shared volume rather than over the network
Fix/Workaround:
• Short-term: Implemented immediate fixes to address the network policy bypass:
# Before: Pod using host network, bypassing network policies
apiVersion: v1
kind: Pod
metadata:
name: monitoring-agent
namespace: monitoring
labels:
app: monitoring
component: agent
spec:
hostNetwork: true
containers:
- name: agent
image: monitoring/agent:v1.2.3
ports:
- containerPort: 8080
hostPort: 8080
# After: Pod using pod network with proper network policies
apiVersion: v1
kind: Pod
metadata:
name: monitoring-agent
namespace: monitoring
labels:
app: monitoring
component: agent
spec:
hostNetwork: false
containers:
- name: agent
image: monitoring/agent:v1.2.3
ports:
- containerPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: monitoring-agent-policy
namespace: monitoring
spec:
podSelector:
matchLabels:
app: monitoring
component: agent
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app: monitoring
component: server
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app: monitoring
component: server
ports:
- protocol: TCP
port: 9090
• Implemented a network policy validation webhook in Go:
// network_policy_validator.go
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"strings"
admissionv1 "k8s.io/api/admission/v1"
corev1 "k8s.io/api/core/v1"
networkingv1 "k8s.io/api/networking/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/serializer"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
var (
runtimeScheme = runtime.NewScheme()
codecs = serializer.NewCodecFactory(runtimeScheme)
deserializer = codecs.UniversalDeserializer()
)
type ValidationWebhook struct {
client *kubernetes.Clientset
}
type patchOperation struct {
Op string `json:"op"`
Path string `json:"path"`
Value interface{} `json:"value,omitempty"`
}
func main() {
// Create Kubernetes client
config, err := rest.InClusterConfig()
if err != nil {
log.Fatalf("Failed to get in-cluster config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Fatalf("Failed to create Kubernetes client: %v", err)
}
webhook := &ValidationWebhook{
client: clientset,
}
// Set up HTTP server
http.HandleFunc("/validate-pod", webhook.validatePod)
http.HandleFunc("/validate-networkpolicy", webhook.validateNetworkPolicy)
http.HandleFunc("/mutate-pod", webhook.mutatePod)
log.Println("Starting webhook server on port 8443...")
log.Fatal(http.ListenAndServeTLS(":8443", "/certs/tls.crt", "/certs/tls.key", nil))
}
func (wh *ValidationWebhook) validatePod(w http.ResponseWriter, r *http.Request) {
review, err := parseAdmissionReview(r)
if err != nil {
http.Error(w, fmt.Sprintf("Failed to parse admission review: %v", err), http.StatusBadRequest)
return
}
pod := corev1.Pod{}
if err := json.Unmarshal(review.Request.Object.Raw, &pod); err != nil {
http.Error(w, fmt.Sprintf("Failed to unmarshal pod: %v", err), http.StatusBadRequest)
return
}
// Validate pod network configuration
allowed := true
var result *metav1.Status
var warnings []string
// Check if pod uses host network
if pod.Spec.HostNetwork {
// Only allow host network in specific namespaces
if !isAllowedHostNetworkNamespace(pod.Namespace) {
allowed = false
result = &metav1.Status{
Message: fmt.Sprintf("Pod %s in namespace %s is not allowed to use host network",
pod.Name, pod.Namespace),
}
} else {
warnings = append(warnings, fmt.Sprintf("Pod %s in namespace %s uses host network, which bypasses network policies",
pod.Name, pod.Namespace))
}
}
// Check if pod has proper labels for network policies
if !hasRequiredNetworkLabels(pod) {
warnings = append(warnings, fmt.Sprintf("Pod %s in namespace %s is missing recommended network policy labels",
pod.Name, pod.Namespace))
}
// Send response
sendAdmissionResponse(w, review, allowed, result, warnings)
}
func (wh *ValidationWebhook) validateNetworkPolicy(w http.ResponseWriter, r *http.Request) {
review, err := parseAdmissionReview(r)
if err != nil {
http.Error(w, fmt.Sprintf("Failed to parse admission review: %v", err), http.StatusBadRequest)
return
}
netpol := networkingv1.NetworkPolicy{}
if err := json.Unmarshal(review.Request.Object.Raw, &netpol); err != nil {
http.Error(w, fmt.Sprintf("Failed to unmarshal network policy: %v", err), http.StatusBadRequest)
return
}
// Validate network policy
allowed := true
var result *metav1.Status
var warnings []string
// Check if network policy has both ingress and egress rules
if !hasCompleteRules(netpol) {
warnings = append(warnings, fmt.Sprintf("NetworkPolicy %s in namespace %s does not specify both ingress and egress rules",
netpol.Name, netpol.Namespace))
}
// Check if network policy uses namespace selectors properly
if !hasProperNamespaceSelectors(netpol) {
warnings = append(warnings, fmt.Sprintf("NetworkPolicy %s in namespace %s may have overly permissive namespace selectors",
netpol.Name, netpol.Namespace))
}
// Send response
sendAdmissionResponse(w, review, allowed, result, warnings)
}
func (wh *ValidationWebhook) mutatePod(w http.ResponseWriter, r *http.Request) {
review, err := parseAdmissionReview(r)
if err != nil {
http.Error(w, fmt.Sprintf("Failed to parse admission review: %v", err), http.StatusBadRequest)
return
}
pod := corev1.Pod{}
if err := json.Unmarshal(review.Request.Object.Raw, &pod); err != nil {
http.Error(w, fmt.Sprintf("Failed to unmarshal pod: %v", err), http.StatusBadRequest)
return
}
// Prepare patches
var patches []patchOperation
// Ensure pod has network policy labels if missing
if pod.Labels == nil {
patches = append(patches, patchOperation{
Op: "add",
Path: "/metadata/labels",
Value: map[string]string{},
})
}
// Add network tier label if missing
if _, ok := pod.Labels["network-tier"]; !ok {
patches = append(patches, patchOperation{
Op: "add",
Path: "/metadata/labels/network-tier",
Value: getDefaultNetworkTier(pod.Namespace),
})
}
// Add network zone label if missing
if _, ok := pod.Labels["network-zone"]; !ok {
patches = append(patches, patchOperation{
Op: "add",
Path: "/metadata/labels/network-zone",
Value: getDefaultNetworkZone(pod.Namespace),
})
}
// Send response with patches
patchBytes, err := json.Marshal(patches)
if err != nil {
http.Error(w, fmt.Sprintf("Failed to marshal patches: %v", err), http.StatusInternalServerError)
return
}
admissionResponse := admissionv1.AdmissionResponse{
UID: review.Request.UID,
Allowed: true,
}
if len(patches) > 0 {
patchType := admissionv1.PatchTypeJSONPatch
admissionResponse.PatchType = &patchType
admissionResponse.Patch = patchBytes
}
admissionReview := admissionv1.AdmissionReview{
TypeMeta: metav1.TypeMeta{
Kind: "AdmissionReview",
APIVersion: "admission.k8s.io/v1",
},
Response: &admissionResponse,
}
resp, err := json.Marshal(admissionReview)
if err != nil {
http.Error(w, fmt.Sprintf("Failed to marshal admission review response: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
w.Write(resp)
}
// Helper functions
func parseAdmissionReview(r *http.Request) (*admissionv1.AdmissionReview, error) {
var body []byte
if r.Body != nil {
if data, err := r.Body.Read(body); err != nil && data == 0 {
return nil, fmt.Errorf("empty body")
}
}
// Decode the admission review
review := &admissionv1.AdmissionReview{}
if _, _, err := deserializer.Decode(body, nil, review); err != nil {
return nil, fmt.Errorf("failed to decode body: %v", err)
}
return review, nil
}
func sendAdmissionResponse(w http.ResponseWriter, review *admissionv1.AdmissionReview, allowed bool, result *metav1.Status, warnings []string) {
response := admissionv1.AdmissionResponse{
UID: review.Request.UID,
Allowed: allowed,
Result: result,
Warnings: warnings,
}
review.Response = &response
resp, err := json.Marshal(review)
if err != nil {
http.Error(w, fmt.Sprintf("Failed to marshal admission review response: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
w.Write(resp)
}
func isAllowedHostNetworkNamespace(namespace string) bool {
allowedNamespaces := []string{"kube-system", "monitoring", "logging"}
for _, ns := range allowedNamespaces {
if namespace == ns {
return true
}
}
return false
}
func hasRequiredNetworkLabels(pod corev1.Pod) bool {
_, hasTier := pod.Labels["network-tier"]
_, hasZone := pod.Labels["network-zone"]
return hasTier && hasZone
}
func hasCompleteRules(netpol networkingv1.NetworkPolicy) bool {
hasIngress := false
hasEgress := false
for _, policyType := range netpol.Spec.PolicyTypes {
if policyType == networkingv1.PolicyTypeIngress {
hasIngress = true
}
if policyType == networkingv1.PolicyTypeEgress {
hasEgress = true
}
}
return hasIngress && hasEgress
}
func hasProperNamespaceSelectors(netpol networkingv1.NetworkPolicy) bool {
// Check if network policy uses namespace selectors without specific labels
for _, ingress := range netpol.Spec.Ingress {
for _, from := range ingress.From {
if from.NamespaceSelector != nil && len(from.NamespaceSelector.MatchLabels) == 0 && len(from.NamespaceSelector.MatchExpressions) == 0 {
return false
}
}
}
for _, egress := range netpol.Spec.Egress {
for _, to := range egress.To {
if to.NamespaceSelector != nil && len(to.NamespaceSelector.MatchLabels) == 0 && len(to.NamespaceSelector.MatchExpressions) == 0 {
return false
}
}
}
return true
}
func getDefaultNetworkTier(namespace string) string {
// Map namespaces to network tiers
tierMap := map[string]string{
"production": "prod",
"staging": "staging",
"development": "dev",
"kube-system": "system",
"monitoring": "system",
"logging": "system",
}
if tier, ok := tierMap[namespace]; ok {
return tier
}
// Default to restricted tier
return "restricted"
}
func getDefaultNetworkZone(namespace string) string {
// Map namespaces to network zones
zoneMap := map[string]string{
"production": "trusted",
"staging": "semi-trusted",
"development": "untrusted",
"kube-system": "system",
"monitoring": "system",
"logging": "system",
}
if zone, ok := zoneMap[namespace]; ok {
return zone
}
// Default to untrusted zone
return "untrusted"
}
• Implemented a network policy auditing tool in Rust:
// network_policy_auditor.rs
use anyhow::{Context, Result};
use futures::StreamExt;
use k8s_openapi::api::core::v1::Pod;
use k8s_openapi::api::networking::v1::NetworkPolicy;
use kube::{
api::{Api, ListParams, ResourceExt},
Client,
};
use serde::Serialize;
use std::collections::{HashMap, HashSet};
use std::fs::File;
use std::io::Write;
use std::path::Path;
use structopt::StructOpt;
#[derive(Debug, StructOpt)]
#[structopt(name = "network-policy-auditor", about = "Kubernetes Network Policy Auditor")]
struct Opt {
/// Output format (json, yaml, table)
#[structopt(short, long, default_value = "table")]
format: String,
/// Output file (if not specified, output to stdout)
#[structopt(short, long)]
output: Option<String>,
/// Kubernetes namespace to audit (if not specified, audit all namespaces)
#[structopt(short, long)]
namespace: Option<String>,
/// Include detailed pod information
#[structopt(long)]
detailed: bool,
/// Only show violations
#[structopt(long)]
violations_only: bool,
}
#[derive(Debug, Serialize)]
struct NamespaceReport {
namespace: String,
pod_count: usize,
network_policy_count: usize,
pods_without_policy: Vec<String>,
host_network_pods: Vec<String>,
violations: Vec<Violation>,
}
#[derive(Debug, Serialize)]
struct Violation {
severity: String,
message: String,
affected_resources: Vec<String>,
recommendation: String,
}
#[derive(Debug, Serialize)]
struct AuditReport {
timestamp: String,
cluster_name: String,
namespaces: Vec<NamespaceReport>,
summary: Summary,
}
#[derive(Debug, Serialize)]
struct Summary {
total_namespaces: usize,
total_pods: usize,
total_network_policies: usize,
total_violations: usize,
violation_by_severity: HashMap<String, usize>,
}
#[tokio::main]
async fn main() -> Result<()> {
let opt = Opt::from_args();
// Initialize Kubernetes client
let client = Client::try_default().await?;
// Get cluster info
let cluster_name = get_cluster_name(&client).await?;
// Get current timestamp
let timestamp = chrono::Utc::now().to_rfc3339();
// Initialize report
let mut report = AuditReport {
timestamp,
cluster_name,
namespaces: Vec::new(),
summary: Summary {
total_namespaces: 0,
total_pods: 0,
total_network_policies: 0,
total_violations: 0,
violation_by_severity: HashMap::new(),
},
};
// Get namespaces to audit
let namespaces = if let Some(ns) = &opt.namespace {
vec![ns.clone()]
} else {
get_all_namespaces(&client).await?
};
report.summary.total_namespaces = namespaces.len();
// Audit each namespace
for namespace in namespaces {
let namespace_report = audit_namespace(&client, &namespace, &opt).await?;
// Update summary
report.summary.total_pods += namespace_report.pod_count;
report.summary.total_network_policies += namespace_report.network_policy_count;
report.summary.total_violations += namespace_report.violations.len();
for violation in &namespace_report.violations {
*report.summary.violation_by_severity
.entry(violation.severity.clone())
.or_insert(0) += 1;
}
// Add namespace report if it has violations or we're not filtering
if !opt.violations_only || !namespace_report.violations.is_empty() {
report.namespaces.push(namespace_report);
}
}
// Output report
output_report(&report, &opt)?;
Ok(())
}
async fn get_cluster_name(client: &Client) -> Result<String> {
let nodes_api: Api<k8s_openapi::api::core::v1::Node> = Api::all(client.clone());
let nodes = nodes_api.list(&ListParams::default()).await?;
if let Some(node) = nodes.items.first() {
if let Some(provider_id) = &node.spec.as_ref().and_then(|s| s.provider_id.as_ref()) {
return Ok(provider_id.split('/').last().unwrap_or("unknown").to_string());
}
}
Ok("unknown".to_string())
}
async fn get_all_namespaces(client: &Client) -> Result<Vec<String>> {
let namespaces_api: Api<k8s_openapi::api::core::v1::Namespace> = Api::all(client.clone());
let namespaces = namespaces_api.list(&ListParams::default()).await?;
Ok(namespaces
.items
.into_iter()
.filter_map(|ns| ns.metadata.name)
.collect())
}
async fn audit_namespace(client: &Client, namespace: &str, opt: &Opt) -> Result<NamespaceReport> {
// Get pods in namespace
let pods_api: Api<Pod> = Api::namespaced(client.clone(), namespace);
let pods = pods_api.list(&ListParams::default()).await?;
// Get network policies in namespace
let netpol_api: Api<NetworkPolicy> = Api::namespaced(client.clone(), namespace);
let netpols = netpol_api.list(&ListParams::default()).await?;
let mut namespace_report = NamespaceReport {
namespace: namespace.to_string(),
pod_count: pods.items.len(),
network_policy_count: netpols.items.len(),
pods_without_policy: Vec::new(),
host_network_pods: Vec::new(),
violations: Vec::new(),
};
// Check for pods using host network
for pod in &pods.items {
let pod_name = pod.name_any();
if pod.spec.as_ref().and_then(|s| s.host_network).unwrap_or(false) {
namespace_report.host_network_pods.push(pod_name.clone());
// Add violation if not in allowed namespace
if !is_allowed_host_network_namespace(namespace) {
namespace_report.violations.push(Violation {
severity: "HIGH".to_string(),
message: format!("Pod {} uses host network in non-system namespace", pod_name),
affected_resources: vec![format!("Pod/{}", pod_name)],
recommendation: "Remove hostNetwork: true from pod spec or move pod to a system namespace".to_string(),
});
}
}
}
// Check for pods without network policies
let mut pods_covered_by_policy = HashSet::new();
for netpol in &netpols.items {
let selector = match &netpol.spec {
Some(spec) => &spec.pod_selector,
None => continue,
};
// Find pods matching this network policy
for pod in &pods.items {
let pod_name = pod.name_any();
let pod_labels = match &pod.metadata.labels {
Some(labels) => labels,
None => continue,
};
if selector_matches_labels(selector, pod_labels) {
pods_covered_by_policy.insert(pod_name);
}
}
// Check if network policy has both ingress and egress rules
if let Some(spec) = &netpol.spec {
let has_ingress = spec.policy_types.as_ref().map_or(false, |types| {
types.contains(&"Ingress".to_string())
});
let has_egress = spec.policy_types.as_ref().map_or(false, |types| {
types.contains(&"Egress".to_string())
});
if !has_ingress || !has_egress {
namespace_report.violations.push(Violation {
severity: "MEDIUM".to_string(),
message: format!(
"NetworkPolicy {} does not specify both ingress and egress rules",
netpol.name_any()
),
affected_resources: vec![format!("NetworkPolicy/{}", netpol.name_any())],
recommendation: "Add both Ingress and Egress to policyTypes".to_string(),
});
}
// Check for overly permissive namespace selectors
if has_overly_permissive_selectors(spec) {
namespace_report.violations.push(Violation {
severity: "HIGH".to_string(),
message: format!(
"NetworkPolicy {} has overly permissive namespace selectors",
netpol.name_any()
),
affected_resources: vec![format!("NetworkPolicy/{}", netpol.name_any())],
recommendation: "Restrict namespace selectors with specific labels".to_string(),
});
}
}
}
// Find pods not covered by any network policy
for pod in &pods.items {
let pod_name = pod.name_any();
if !pods_covered_by_policy.contains(&pod_name) {
namespace_report.pods_without_policy.push(pod_name.clone());
// Add violation if not in system namespace
if !is_system_namespace(namespace) {
namespace_report.violations.push(Violation {
severity: "MEDIUM".to_string(),
message: format!("Pod {} is not covered by any NetworkPolicy", pod_name),
affected_resources: vec![format!("Pod/{}", pod_name)],
recommendation: "Create a NetworkPolicy that selects this pod".to_string(),
});
}
}
}
// Check for missing network policy labels
for pod in &pods.items {
let pod_name = pod.name_any();
let pod_labels = match &pod.metadata.labels {
Some(labels) => labels,
None => {
// Add violation for missing labels
if !is_system_namespace(namespace) {
namespace_report.violations.push(Violation {
severity: "LOW".to_string(),
message: format!("Pod {} has no labels for NetworkPolicy selection", pod_name),
affected_resources: vec![format!("Pod/{}", pod_name)],
recommendation: "Add appropriate labels for NetworkPolicy selection".to_string(),
});
}
continue;
}
};
// Check for recommended network policy labels
if !pod_labels.contains_key("network-tier") || !pod_labels.contains_key("network-zone") {
if !is_system_namespace(namespace) {
namespace_report.violations.push(Violation {
severity: "LOW".to_string(),
message: format!(
"Pod {} is missing recommended network policy labels (network-tier, network-zone)",
pod_name
),
affected_resources: vec![format!("Pod/{}", pod_name)],
recommendation: "Add network-tier and network-zone labels".to_string(),
});
}
}
}
Ok(namespace_report)
}
fn is_allowed_host_network_namespace(namespace: &str) -> bool {
matches!(namespace, "kube-system" | "monitoring" | "logging")
}
fn is_system_namespace(namespace: &str) -> bool {
namespace.starts_with("kube-") || matches!(namespace, "monitoring" | "logging")
}
fn selector_matches_labels(
selector: &k8s_openapi::apimachinery::pkg::apis::meta::v1::LabelSelector,
labels: &HashMap<String, String>,
) -> bool {
// If selector is empty, it selects all pods
if selector.match_labels.is_none() && selector.match_expressions.is_none() {
return true;
}
// Check match_labels
if let Some(match_labels) = &selector.match_labels {
for (key, value) in match_labels {
if !labels.get(key).map_or(false, |v| v == value) {
return false;
}
}
}
// Check match_expressions (simplified implementation)
if let Some(expressions) = &selector.match_expressions {
for expr in expressions {
let label_value = labels.get(&expr.key);
match expr.operator.as_str() {
"In" => {
if !expr.values.as_ref().map_or(false, |values| {
label_value.map_or(false, |v| values.contains(v))
}) {
return false;
}
}
"NotIn" => {
if expr.values.as_ref().map_or(false, |values| {
label_value.map_or(false, |v| values.contains(v))
}) {
return false;
}
}
"Exists" => {
if label_value.is_none() {
return false;
}
}
"DoesNotExist" => {
if label_value.is_some() {
return false;
}
}
_ => {}
}
}
}
true
}
fn has_overly_permissive_selectors(spec: &k8s_openapi::api::networking::v1::NetworkPolicySpec) -> bool {
// Check ingress rules
if let Some(ingress) = &spec.ingress {
for rule in ingress {
if let Some(from) = &rule.from {
for peer in from {
if let Some(ns_selector) = &peer.namespace_selector {
if ns_selector.match_labels.is_none() && ns_selector.match_expressions.is_none() {
return true;
}
}
}
}
}
}
// Check egress rules
if let Some(egress) = &spec.egress {
for rule in egress {
if let Some(to) = &rule.to {
for peer in to {
if let Some(ns_selector) = &peer.namespace_selector {
if ns_selector.match_labels.is_none() && ns_selector.match_expressions.is_none() {
return true;
}
}
}
}
}
}
false
}
fn output_report(report: &AuditReport, opt: &Opt) -> Result<()> {
let output = match opt.format.as_str() {
"json" => serde_json::to_string_pretty(report)?,
"yaml" => serde_yaml::to_string(report)?,
"table" => format_as_table(report, opt.detailed),
_ => return Err(anyhow::anyhow!("Unsupported output format: {}", opt.format)),
};
if let Some(output_file) = &opt.output {
let path = Path::new(output_file);
let mut file = File::create(path).context("Failed to create output file")?;
file.write_all(output.as_bytes()).context("Failed to write to output file")?;
println!("Report written to {}", output_file);
} else {
println!("{}", output);
}
Ok(())
}
fn format_as_table(report: &AuditReport, detailed: bool) -> String {
let mut output = String::new();
output.push_str(&format!("Network Policy Audit Report\n"));
output.push_str(&format!("Timestamp: {}\n", report.timestamp));
output.push_str(&format!("Cluster: {}\n\n", report.cluster_name));
output.push_str(&format!("Summary:\n"));
output.push_str(&format!(" Total Namespaces: {}\n", report.summary.total_namespaces));
output.push_str(&format!(" Total Pods: {}\n", report.summary.total_pods));
output.push_str(&format!(" Total Network Policies: {}\n", report.summary.total_network_policies));
output.push_str(&format!(" Total Violations: {}\n", report.summary.total_violations));
output.push_str(&format!("\nViolations by Severity:\n"));
for (severity, count) in &report.summary.violation_by_severity {
output.push_str(&format!(" {}: {}\n", severity, count));
}
output.push_str(&format!("\nNamespace Reports:\n"));
for ns_report in &report.namespaces {
output.push_str(&format!("\n{}\n", "=".repeat(80)));
output.push_str(&format!("Namespace: {}\n", ns_report.namespace));
output.push_str(&format!("Pods: {}\n", ns_report.pod_count));
output.push_str(&format!("Network Policies: {}\n", ns_report.network_policy_count));
if detailed {
if !ns_report.host_network_pods.is_empty() {
output.push_str(&format!("\nPods using host network:\n"));
for pod in &ns_report.host_network_pods {
output.push_str(&format!(" - {}\n", pod));
}
}
if !ns_report.pods_without_policy.is_empty() {
output.push_str(&format!("\nPods not covered by any NetworkPolicy:\n"));
for pod in &ns_report.pods_without_policy {
output.push_str(&format!(" - {}\n", pod));
}
}
}
if !ns_report.violations.is_empty() {
output.push_str(&format!("\nViolations:\n"));
for (i, violation) in ns_report.violations.iter().enumerate() {
output.push_str(&format!(" {}. [{}] {}\n", i + 1, violation.severity, violation.message));
output.push_str(&format!(" Affected Resources: {}\n", violation.affected_resources.join(", ")));
output.push_str(&format!(" Recommendation: {}\n", violation.recommendation));
}
}
}
output
}
• Long-term: Implemented a comprehensive network security strategy:
- Created a network policy management framework
- Implemented automated network policy testing
- Developed a network traffic visualization tool
- Established clear procedures for network policy changes
- Implemented monitoring and alerting for network policy violations
Lessons Learned:
Container network policies require careful configuration and monitoring to ensure proper isolation.
How to Avoid:
Avoid using host network mode for containers when possible.
Implement proper network policies with both ingress and egress rules.
Use consistent labeling for network policy selection.
Regularly audit network policies and test their effectiveness.
Implement automated validation for network policy changes.
No summary provided
What Happened:
During a routine security audit, penetration testers discovered that pods in a restricted namespace could communicate with pods in a PCI-compliant namespace, despite network policies that should have prevented this communication. This vulnerability potentially exposed sensitive financial data to less secure parts of the application. The issue was discovered before any actual breach occurred, but represented a significant security risk.
Diagnosis Steps:
Analyzed existing network policies across all namespaces.
Tested pod-to-pod communication paths using network debugging tools.
Reviewed CNI configuration and network plugin settings.
Examined pod labels and namespace configurations.
Audited recent changes to network policies and cluster configuration.
Root Cause:
The investigation revealed multiple issues with the network policy implementation: 1. Network policies were using incorrect pod selector labels that didn't match actual pods 2. Some pods were missing the expected labels entirely, causing them to be excluded from policy enforcement 3. The Calico CNI configuration had a misconfiguration that prevented proper enforcement of policies 4. Default allow rules in one namespace were overriding deny rules in connected namespaces 5. The network policy audit logging was disabled, preventing detection of policy violations
Fix/Workaround:
• Short-term: Implemented immediate fixes to secure the environment:
# Before: Problematic NetworkPolicy with incorrect selectors
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: pci-namespace-isolation
namespace: pci-compliant
spec:
podSelector: {} # Applies to all pods in namespace
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
environment: pci-approved # Incorrect label, should be 'compliance: pci-approved'
- podSelector:
matchLabels:
role: payment-processor # Some pods were missing this label
# After: Corrected NetworkPolicy with proper selectors
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: pci-namespace-isolation
namespace: pci-compliant
spec:
podSelector: {} # Applies to all pods in namespace
policyTypes:
- Ingress
- Egress # Added egress rules for complete isolation
ingress:
- from:
- namespaceSelector:
matchLabels:
compliance: pci-approved
podSelector:
matchLabels:
role: payment-processor
ports:
- protocol: TCP
port: 8443
egress:
- to:
- namespaceSelector:
matchLabels:
compliance: pci-approved
ports:
- protocol: TCP
port: 8443
• Fixed Calico CNI configuration:
# Before: Problematic Calico configuration
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
reportingInterval: 0s
ipipEnabled: true
logFilePath: /var/log/calico/felix.log
prometheusMetricsEnabled: true
# After: Corrected Calico configuration with policy enforcement and logging
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
reportingInterval: 0s
ipipEnabled: true
logFilePath: /var/log/calico/felix.log
prometheusMetricsEnabled: true
policyLogSeverity: Info # Enable policy logging
failsafeInboundHostPorts: # Define failsafe ports
- protocol: tcp
port: 22
- protocol: udp
port: 68
failsafeOutboundHostPorts:
- protocol: tcp
port: 53
- protocol: udp
port: 53
- protocol: udp
port: 67
- protocol: tcp
port: 179 # BGP
- protocol: tcp
port: 443 # HTTPS for API server
- protocol: tcp
port: 6443 # Kubernetes API
• Implemented a network policy validation script:
#!/usr/bin/env python3
# network_policy_validator.py - Validate network policies against actual pod labels
import subprocess
import json
import yaml
import sys
import re
from collections import defaultdict
def run_command(command):
"""Run a command and return the output as JSON."""
result = subprocess.run(command, shell=True, capture_output=True, text=True)
if result.returncode != 0:
print(f"Error running command: {command}")
print(f"Error: {result.stderr}")
sys.exit(1)
return json.loads(result.stdout)
def get_all_pods():
"""Get all pods in the cluster with their labels and namespaces."""
pods = run_command("kubectl get pods --all-namespaces -o json")
pod_info = []
for pod in pods["items"]:
pod_info.append({
"name": pod["metadata"]["name"],
"namespace": pod["metadata"]["namespace"],
"labels": pod["metadata"].get("labels", {}),
})
return pod_info
def get_all_network_policies():
"""Get all network policies in the cluster."""
policies = run_command("kubectl get networkpolicies --all-namespaces -o json")
policy_info = []
for policy in policies["items"]:
policy_info.append({
"name": policy["metadata"]["name"],
"namespace": policy["metadata"]["namespace"],
"spec": policy["spec"],
})
return policy_info
def get_all_namespaces():
"""Get all namespaces with their labels."""
namespaces = run_command("kubectl get namespaces -o json")
namespace_info = []
for ns in namespaces["items"]:
namespace_info.append({
"name": ns["metadata"]["name"],
"labels": ns["metadata"].get("labels", {}),
})
return namespace_info
def validate_pod_selectors(policies, pods):
"""Validate that pod selectors in network policies match actual pods."""
issues = []
for policy in policies:
policy_ns = policy["namespace"]
policy_name = policy["name"]
pod_selector = policy["spec"].get("podSelector", {})
# Skip if podSelector is empty (applies to all pods)
if not pod_selector:
continue
match_labels = pod_selector.get("matchLabels", {})
match_expressions = pod_selector.get("matchExpressions", [])
# Get pods in the same namespace as the policy
ns_pods = [p for p in pods if p["namespace"] == policy_ns]
# Check if any pods match the selector
matching_pods = []
for pod in ns_pods:
pod_labels = pod["labels"]
# Check matchLabels
labels_match = all(
key in pod_labels and pod_labels[key] == value
for key, value in match_labels.items()
)
# Check matchExpressions (simplified)
expressions_match = True
for expr in match_expressions:
key = expr["key"]
operator = expr["operator"]
values = expr.get("values", [])
if key not in pod_labels:
expressions_match = False
break
if operator == "In" and pod_labels[key] not in values:
expressions_match = False
break
elif operator == "NotIn" and pod_labels[key] in values:
expressions_match = False
break
elif operator == "Exists" and key not in pod_labels:
expressions_match = False
break
elif operator == "DoesNotExist" and key in pod_labels:
expressions_match = False
break
if labels_match and expressions_match:
matching_pods.append(pod["name"])
if not matching_pods:
issues.append({
"policy": f"{policy_ns}/{policy_name}",
"issue": "No pods match the policy's podSelector",
"selector": match_labels,
"expressions": match_expressions,
"namespace_pods": [p["name"] for p in ns_pods],
})
return issues
def validate_namespace_selectors(policies, namespaces):
"""Validate that namespace selectors in network policies match actual namespaces."""
issues = []
for policy in policies:
policy_ns = policy["namespace"]
policy_name = policy["name"]
# Check ingress rules
ingress_rules = policy["spec"].get("ingress", [])
for i, rule in enumerate(ingress_rules):
from_rules = rule.get("from", [])
for j, from_rule in enumerate(from_rules):
ns_selector = from_rule.get("namespaceSelector", {})
if not ns_selector:
continue
match_labels = ns_selector.get("matchLabels", {})
match_expressions = ns_selector.get("matchExpressions", [])
# Check if any namespaces match the selector
matching_ns = []
for ns in namespaces:
ns_labels = ns["labels"]
# Check matchLabels
labels_match = all(
key in ns_labels and ns_labels[key] == value
for key, value in match_labels.items()
)
# Check matchExpressions (simplified)
expressions_match = True
for expr in match_expressions:
key = expr["key"]
operator = expr["operator"]
values = expr.get("values", [])
if key not in ns_labels:
expressions_match = False
break
if operator == "In" and ns_labels[key] not in values:
expressions_match = False
break
elif operator == "NotIn" and ns_labels[key] in values:
expressions_match = False
break
elif operator == "Exists" and key not in ns_labels:
expressions_match = False
break
elif operator == "DoesNotExist" and key in ns_labels:
expressions_match = False
break
if labels_match and expressions_match:
matching_ns.append(ns["name"])
if not matching_ns:
issues.append({
"policy": f"{policy_ns}/{policy_name}",
"rule": f"ingress[{i}].from[{j}]",
"issue": "No namespaces match the namespaceSelector",
"selector": match_labels,
"expressions": match_expressions,
})
# Check egress rules
egress_rules = policy["spec"].get("egress", [])
for i, rule in enumerate(egress_rules):
to_rules = rule.get("to", [])
for j, to_rule in enumerate(to_rules):
ns_selector = to_rule.get("namespaceSelector", {})
if not ns_selector:
continue
match_labels = ns_selector.get("matchLabels", {})
match_expressions = ns_selector.get("matchExpressions", [])
# Check if any namespaces match the selector
matching_ns = []
for ns in namespaces:
ns_labels = ns["labels"]
# Check matchLabels
labels_match = all(
key in ns_labels and ns_labels[key] == value
for key, value in match_labels.items()
)
# Check matchExpressions (simplified)
expressions_match = True
for expr in match_expressions:
key = expr["key"]
operator = expr["operator"]
values = expr.get("values", [])
if key not in ns_labels:
expressions_match = False
break
if operator == "In" and ns_labels[key] not in values:
expressions_match = False
break
elif operator == "NotIn" and ns_labels[key] in values:
expressions_match = False
break
elif operator == "Exists" and key not in ns_labels:
expressions_match = False
break
elif operator == "DoesNotExist" and key in ns_labels:
expressions_match = False
break
if labels_match and expressions_match:
matching_ns.append(ns["name"])
if not matching_ns:
issues.append({
"policy": f"{policy_ns}/{policy_name}",
"rule": f"egress[{i}].to[{j}]",
"issue": "No namespaces match the namespaceSelector",
"selector": match_labels,
"expressions": match_expressions,
})
return issues
def check_default_allow_policies(policies):
"""Check for overly permissive default allow policies."""
issues = []
for policy in policies:
policy_ns = policy["namespace"]
policy_name = policy["name"]
pod_selector = policy["spec"].get("podSelector", {})
ingress = policy["spec"].get("ingress", [])
egress = policy["spec"].get("egress", [])
# Check for empty podSelector with empty ingress/egress rules
if not pod_selector and ingress and not ingress[0].get("from"):
issues.append({
"policy": f"{policy_ns}/{policy_name}",
"issue": "Default allow ingress policy detected",
"details": "Policy applies to all pods in namespace and allows all ingress traffic",
})
if not pod_selector and egress and not egress[0].get("to"):
issues.append({
"policy": f"{policy_ns}/{policy_name}",
"issue": "Default allow egress policy detected",
"details": "Policy applies to all pods in namespace and allows all egress traffic",
})
return issues
def check_missing_egress_rules(policies):
"""Check for policies that have ingress rules but no egress rules."""
issues = []
for policy in policies:
policy_ns = policy["namespace"]
policy_name = policy["name"]
ingress = policy["spec"].get("ingress", [])
egress = policy["spec"].get("egress", [])
policy_types = policy["spec"].get("policyTypes", [])
if ingress and not egress and "Egress" not in policy_types:
issues.append({
"policy": f"{policy_ns}/{policy_name}",
"issue": "Policy has ingress rules but no egress rules",
"details": "This allows unrestricted outbound traffic which may be a security risk",
})
return issues
def main():
print("Validating Kubernetes Network Policies...")
pods = get_all_pods()
policies = get_all_network_policies()
namespaces = get_all_namespaces()
print(f"Found {len(pods)} pods, {len(policies)} network policies, and {len(namespaces)} namespaces")
# Run validations
pod_selector_issues = validate_pod_selectors(policies, pods)
namespace_selector_issues = validate_namespace_selectors(policies, namespaces)
default_allow_issues = check_default_allow_policies(policies)
missing_egress_issues = check_missing_egress_rules(policies)
# Print results
if pod_selector_issues:
print("\n=== Pod Selector Issues ===")
for issue in pod_selector_issues:
print(f"Policy: {issue['policy']}")
print(f"Issue: {issue['issue']}")
print(f"Selector: {issue['selector']}")
print("---")
if namespace_selector_issues:
print("\n=== Namespace Selector Issues ===")
for issue in namespace_selector_issues:
print(f"Policy: {issue['policy']}")
print(f"Rule: {issue['rule']}")
print(f"Issue: {issue['issue']}")
print(f"Selector: {issue['selector']}")
print("---")
if default_allow_issues:
print("\n=== Default Allow Issues ===")
for issue in default_allow_issues:
print(f"Policy: {issue['policy']}")
print(f"Issue: {issue['issue']}")
print(f"Details: {issue['details']}")
print("---")
if missing_egress_issues:
print("\n=== Missing Egress Rules ===")
for issue in missing_egress_issues:
print(f"Policy: {issue['policy']}")
print(f"Issue: {issue['issue']}")
print(f"Details: {issue['details']}")
print("---")
# Summary
total_issues = len(pod_selector_issues) + len(namespace_selector_issues) + len(default_allow_issues) + len(missing_egress_issues)
if total_issues == 0:
print("\n✅ No issues found in network policies!")
else:
print(f"\n❌ Found {total_issues} issues in network policies")
print("Please review and fix these issues to ensure proper network isolation")
if __name__ == "__main__":
main()
• Created a Bash script for testing network connectivity between pods:
#!/bin/bash
# network_policy_tester.sh - Test network connectivity between pods across namespaces
set -e
SOURCE_NS=${1:-"default"}
SOURCE_POD_LABEL=${2:-"app=network-tester"}
TARGET_NS=${3:-"all"}
TARGET_PORT=${4:-"80"}
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
NC='\033[0m' # No Color
echo -e "${YELLOW}Network Policy Connectivity Tester${NC}"
echo "Testing connectivity from pods in namespace: $SOURCE_NS with label: $SOURCE_POD_LABEL"
# Check if network test pod exists, create if not
if ! kubectl get pod -n $SOURCE_NS -l $SOURCE_POD_LABEL &>/dev/null; then
echo -e "${YELLOW}Creating network test pod in namespace $SOURCE_NS...${NC}"
kubectl run network-tester -n $SOURCE_NS --labels=$SOURCE_POD_LABEL --image=nicolaka/netshoot -- sleep 3600
echo "Waiting for pod to be ready..."
kubectl wait --for=condition=ready pod -n $SOURCE_NS -l $SOURCE_POD_LABEL --timeout=60s
fi
# Get the name of the test pod
SOURCE_POD=$(kubectl get pod -n $SOURCE_NS -l $SOURCE_POD_LABEL -o jsonpath='{.items[0].metadata.name}')
echo "Using source pod: $SOURCE_POD in namespace: $SOURCE_NS"
# Get target namespaces
if [ "$TARGET_NS" == "all" ]; then
TARGET_NAMESPACES=$(kubectl get ns -o jsonpath='{.items[*].metadata.name}')
else
TARGET_NAMESPACES=$TARGET_NS
fi
# Function to test connectivity to a pod
test_connectivity() {
local target_ns=$1
local target_pod=$2
local target_ip=$3
local target_port=$4
echo -e "${YELLOW}Testing connectivity to $target_pod ($target_ip:$target_port) in namespace $target_ns...${NC}"
# Try TCP connection with timeout
if kubectl exec -n $SOURCE_NS $SOURCE_POD -- timeout 3 bash -c "nc -zv -w 2 $target_ip $target_port 2>&1"; then
echo -e "${GREEN}✅ Connection SUCCESSFUL to $target_pod in $target_ns${NC}"
return 0
else
echo -e "${RED}❌ Connection FAILED to $target_pod in $target_ns${NC}"
return 1
fi
}
# Test connectivity to pods in target namespaces
for ns in $TARGET_NAMESPACES; do
echo -e "\n${YELLOW}Scanning namespace: $ns${NC}"
# Skip kube-system and other system namespaces if testing all
if [ "$TARGET_NS" == "all" ] && [[ "$ns" =~ ^(kube-system|kube-public|kube-node-lease)$ ]]; then
echo "Skipping system namespace: $ns"
continue
fi
# Get all pods in the namespace
PODS=$(kubectl get pods -n $ns -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{" "}{.spec.containers[0].ports[0].containerPort}{"\n"}{end}' 2>/dev/null)
if [ -z "$PODS" ]; then
echo "No pods found in namespace $ns or no IP/port information available"
continue
fi
# Test connectivity to each pod
while read -r pod_info; do
if [ -z "$pod_info" ]; then
continue
fi
pod_name=$(echo $pod_info | awk '{print $1}')
pod_ip=$(echo $pod_info | awk '{print $2}')
pod_port=$(echo $pod_info | awk '{print $3}')
# Use default port if not specified
if [ -z "$pod_port" ]; then
pod_port=$TARGET_PORT
fi
if [ -n "$pod_ip" ]; then
test_connectivity $ns $pod_name $pod_ip $pod_port
fi
done <<< "$PODS"
done
echo -e "\n${YELLOW}Network connectivity testing complete${NC}"
• Long-term: Implemented a comprehensive network security strategy:
- Developed automated network policy validation in CI/CD pipelines
- Implemented network policy visualization and documentation
- Created a centralized network policy management system
- Established regular network security testing and auditing
- Implemented network traffic monitoring and anomaly detection
Lessons Learned:
Network policies require careful configuration and regular validation to be effective.
How to Avoid:
Implement automated validation of network policies against actual pod and namespace labels.
Use a consistent labeling strategy across all resources.
Test network policies in a staging environment before production.
Implement network policy logging and monitoring.
Regularly audit and test network isolation between namespaces.
No summary provided
What Happened:
Security researchers published details of a critical remote code execution vulnerability in a popular open-source library used by multiple services in the production environment. The vulnerability had no patch available yet, but proof-of-concept exploit code was already circulating. The security team needed to implement immediate mitigation measures while waiting for an official patch, all without disrupting critical business services.
Diagnosis Steps:
Assessed the vulnerability details and potential impact.
Identified all affected services and their criticality.
Evaluated possible mitigation strategies.
Tested mitigation measures in staging environment.
Developed a phased implementation plan with rollback options.
Root Cause:
The vulnerability existed in a widely used authentication library that allowed attackers to bypass authentication and execute arbitrary code through a specially crafted request header.
Fix/Workaround:
• Implemented immediate network-level protections
• Deployed WAF rules to block exploit patterns
• Created custom network policies to restrict traffic
• Implemented additional monitoring for exploitation attempts
• Developed a phased patching strategy
Lessons Learned:
Zero-day vulnerabilities require rapid response with multiple layers of defense.
How to Avoid:
Maintain up-to-date dependency inventory for all applications.
Implement defense-in-depth strategies with multiple security layers.
Establish clear security incident response procedures.
Develop and test emergency deployment procedures.
Subscribe to security advisories for critical dependencies.
No summary provided
What Happened:
During a routine security audit, the security team discovered unexpected network traffic between pods in different namespaces that should have been isolated according to the defined network policies. Further investigation revealed that certain pods were able to communicate with services in restricted namespaces, bypassing the intended security controls. This raised concerns about potential lateral movement opportunities for attackers and compliance violations.
Diagnosis Steps:
Analyzed network traffic logs between namespaces.
Reviewed all network policies across the cluster.
Tested network connectivity using debugging pods.
Examined pod labels and namespace configurations.
Verified CNI plugin configuration and version.
Root Cause:
The investigation revealed multiple issues with the network policy implementation: 1. Some pods had incorrect labels that didn't match the network policy selectors 2. The Calico CNI plugin was misconfigured with conflicting global and namespace policies 3. A recent Kubernetes upgrade had changed the behavior of certain network policy features 4. Default allow rules were taking precedence over deny rules in some cases 5. Egress policies were missing for some workloads, allowing outbound connections
Fix/Workaround:
• Implemented immediate fixes to restore proper network isolation
• Corrected pod labels to match network policy selectors
• Resolved CNI configuration issues and updated to latest version
• Implemented proper policy precedence and default deny rules
• Created comprehensive egress policies for all workloads
Lessons Learned:
Network policies require regular validation and testing to ensure effectiveness.
How to Avoid:
Implement regular network policy validation testing.
Use network policy visualization tools to understand policy effects.
Create automated tests for network isolation between namespaces.
Establish clear ownership and review processes for network policies.
Monitor and alert on unexpected cross-namespace traffic.
No summary provided
What Happened:
A company's security monitoring system detected unusual network activity originating from a production Kubernetes cluster. Investigation revealed that an attacker had exploited a container escape vulnerability in a running container to gain access to the host node. From there, they attempted lateral movement within the network by scanning for other vulnerable systems. The incident was detected before significant damage occurred, but it highlighted serious security gaps in the container runtime configuration and network security controls.
Diagnosis Steps:
Analyzed security alerts and network traffic logs.
Examined the compromised container and host system.
Reviewed container runtime configuration and privileges.
Checked network policies and segmentation.
Investigated the initial attack vector and exploitation method.
Root Cause:
The investigation revealed multiple security issues: 1. The container was running with excessive privileges (--privileged flag) 2. The container runtime had an unpatched vulnerability 3. Host system security controls were insufficient 4. Network segmentation between pods and nodes was inadequate 5. Container image scanning had missed a vulnerable component
Fix/Workaround:
• Implemented immediate containment and remediation
• Patched the container runtime vulnerability
• Removed privileged access from all containers
• Implemented proper network segmentation
• Enhanced security monitoring and alerting
Lessons Learned:
Container security requires defense in depth, including proper configuration, patching, and network controls.
How to Avoid:
Never run containers with the --privileged flag unless absolutely necessary.
Implement strict pod security policies to enforce least privilege.
Keep container runtimes and host systems patched.
Implement proper network segmentation and policies.
Use runtime security monitoring for containers and hosts.
No summary provided
What Happened:
A large financial services company used Envoy as an API gateway and service mesh proxy throughout their Kubernetes environment. Security monitoring detected unusual access patterns to internal services that should have been protected by authentication. Investigation revealed that attackers were exploiting a previously unknown vulnerability in the proxy's request handling logic to bypass authentication checks. The vulnerability affected all production clusters and potentially exposed sensitive customer data. The incident triggered an emergency response to mitigate the vulnerability before it could be widely exploited.
Diagnosis Steps:
Analyzed network traffic logs for unusual access patterns.
Examined proxy configuration and authentication rules.
Reviewed recent changes to the proxy deployment.
Tested authentication bypass scenarios in a controlled environment.
Collaborated with the proxy vendor to understand the vulnerability.
Root Cause:
The investigation revealed a critical vulnerability in the proxy's request handling: 1. The proxy incorrectly handled certain malformed HTTP headers 2. This allowed attackers to inject specially crafted headers that bypassed authentication checks 3. The vulnerability existed in multiple versions of the proxy software 4. The issue was in the core request processing pipeline, affecting all authentication methods 5. The vulnerability had not been publicly disclosed or patched
Fix/Workaround:
• Implemented immediate mitigations to block the attack
• Deployed a custom Lua filter to validate and sanitize incoming requests
• Applied network policies to restrict access to sensitive services
• Worked with the vendor to develop and test a proper fix
• Deployed the vendor-provided emergency patch across all environments
Lessons Learned:
Zero-day vulnerabilities in network components require rapid response capabilities and defense-in-depth strategies.
How to Avoid:
Implement defense-in-depth with multiple security layers.
Deploy network monitoring to detect unusual access patterns.
Regularly update and patch network components.
Use network policies to restrict access between services.
Implement custom security filters for critical components.
No summary provided
What Happened:
A large financial services company implemented Kubernetes Network Policies to enhance their security posture by enforcing the principle of least privilege for pod-to-pod communication. After deploying a new set of policies, several critical microservices began experiencing connection timeouts and failures. The issue was particularly severe for a payment processing service that suddenly couldn't communicate with its dependent services. The incident occurred during business hours and affected customer transactions, triggering a high-severity incident response.
Diagnosis Steps:
Analyzed connection failures in service logs.
Reviewed recently deployed Network Policies.
Tested connectivity between affected services using debug pods.
Examined Calico logs for policy enforcement decisions.
Traced network paths between services using network tools.
Root Cause:
The investigation revealed multiple issues with the Network Policy implementation: 1. The new policies used overly restrictive pod selectors that didn't account for all service instances 2. The policies incorrectly specified namespaceSelector criteria, blocking cross-namespace communication 3. The ingress rules didn't properly account for all required ports and protocols 4. Some policies had conflicting rules that resulted in unexpected deny decisions 5. The policy testing process didn't validate all communication paths before deployment
Fix/Workaround:
• Implemented immediate fix to restore service
• Temporarily disabled the problematic Network Policies
• Corrected the pod and namespace selectors in the policies
• Added comprehensive ingress and egress rules for all required communication
• Implemented proper testing procedures for Network Policy changes
Lessons Learned:
Kubernetes Network Policies require careful planning, testing, and validation to avoid unintended consequences.
How to Avoid:
Implement Network Policies incrementally, starting with monitoring mode.
Create comprehensive test procedures for policy changes.
Document all required service communication paths.
Use visualization tools to understand policy effects before deployment.
Implement canary deployments for policy changes.