After a system update, containers on the same bridge network could no longer communicate. DNS resolution between containers failed, and ping requests timed out.
# Docker Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Verified containers were on the same network using
docker network inspect bridge
.Checked iptables rules with
iptables -L -n
and found DROP rules for inter-container traffic.Examined Docker daemon logs with
journalctl -u docker.service
.Discovered that a security tool had modified iptables rules.
Root Cause:
A security hardening script had added iptables rules that blocked Docker's bridge network traffic.
Fix/Workaround:
• Temporarily restored connectivity by running:
iptables -I FORWARD -i docker0 -o docker0 -j ACCEPT
• Modified the security hardening script to whitelist Docker networks.
• Restarted the Docker daemon to apply changes.
Lessons Learned:
Docker networking relies on iptables rules that can be affected by system-level changes.
How to Avoid:
Document all security tools and their effects on networking.
Test container connectivity after system updates.
Use Docker's user-defined networks instead of the default bridge for better isolation and DNS resolution.
No summary provided
What Happened:
A microservice deployment started failing in production with "ImagePullBackOff" errors. The Docker image had grown to over 4GB, causing timeouts during the pull process.
Diagnosis Steps:
Examined the Dockerfile and found no multi-stage builds.
Used
docker history <image>
to analyze layer sizes.Discovered large temporary files and build artifacts in the final image.
Found that the .dockerignore file was missing, causing unnecessary files to be included.
Root Cause:
Poor Dockerfile practices led to image bloat, including: 1. No multi-stage builds 2. Missing .dockerignore file 3. Temporary files not cleaned up in the same layer they were created
Fix/Workaround:
• Implemented multi-stage builds to separate build and runtime environments.
• Added a comprehensive .dockerignore file.
• Combined RUN commands to reduce layer count and clean up in the same layer.
• Final image size reduced from 4GB to 200MB.
Lessons Learned:
Docker image size directly impacts deployment reliability and speed.
How to Avoid:
Use multi-stage builds for compiled applications.
Implement and maintain a proper .dockerignore file.
Audit image sizes as part of CI/CD pipeline.
Use tools like DockerSlim or dive to analyze and optimize images.
No summary provided
What Happened:
A Java microservice was deployed with 2GB memory limit, but kept getting OOMKilled. The JVM was not respecting the container memory limits.
Diagnosis Steps:
Reviewed container logs showing OOMKilled errors.
Checked memory settings with
docker stats
during high load.Analyzed JVM arguments and found no explicit memory settings.
Used
jcmd <pid> VM.native_memory
inside the container to check memory usage.
Root Cause:
The Java application was using pre-Java 10 runtime that doesn't recognize container memory limits. The JVM was allocating memory based on host resources, not container limits.
Fix/Workaround:
• Added explicit JVM memory settings:
JAVA_OPTS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0"
• Updated to Java 11 which has better container awareness.
• Adjusted container memory request and limit to realistic values based on application profiling.
Lessons Learned:
Not all applications are container-aware, especially older Java applications.
How to Avoid:
Use container-aware JVM versions (Java 10+).
Always set explicit memory limits for JVM applications.
Test applications under memory pressure before production deployment.
Monitor container memory usage patterns to set appropriate limits.
No summary provided
What Happened:
During a routine deployment, containers were recreated as expected, but the team discovered that important application data had disappeared. Users reported missing configurations and content that had been present before the deployment.
Diagnosis Steps:
Checked container logs for error messages related to data access.
Inspected volume configurations with
docker volume ls
anddocker volume inspect
.Reviewed the docker-compose.yml and Dockerfile for volume definitions.
Examined deployment scripts and found containers were being removed with the
-v
flag.
Root Cause:
The deployment script was using docker rm -v
which removes associated anonymous volumes. The application was using anonymous volumes instead of named volumes, causing data to be deleted when containers were removed.
Fix/Workaround:
• Restored data from the most recent backup.
• Modified docker-compose.yml to use named volumes instead of anonymous volumes:
volumes:
app_data:
name: app_data
• Updated deployment scripts to preserve volumes during container recreation.
• Added volume backup steps to the deployment process.
Lessons Learned:
Docker volume persistence requires explicit configuration and careful handling during container lifecycle operations.
How to Avoid:
Always use named volumes for persistent data.
Never use the
-v
flag withdocker rm
unless data loss is acceptable.Include volume backup in deployment procedures.
Document volume architecture and persistence requirements.
Test data persistence during container recreation in staging environments.
No summary provided
What Happened:
A security audit revealed that containers belonging to different customers could communicate with each other, despite being on separate Docker networks. This violated the multi-tenancy isolation guarantees.
Diagnosis Steps:
Verified network configurations using
docker network inspect
.Tested connectivity between containers with
docker exec <container> ping <other-container>
.Analyzed iptables rules and Docker network settings.
Discovered a custom network plugin that was incorrectly configured.
Root Cause:
A custom Docker network plugin was misconfigured, allowing traffic between networks that should have been isolated. The plugin was not properly validating network boundaries.
Fix/Workaround:
• Temporarily reverted to the default bridge driver with manual isolation.
• Reconfigured the network plugin with proper isolation rules.
• Implemented additional network policy enforcement with iptables.
• Added network security monitoring to detect cross-network traffic.
Lessons Learned:
Docker network isolation requires validation beyond default configurations, especially with custom plugins.
How to Avoid:
Regularly test network isolation with security scans.
Implement defense in depth with multiple isolation mechanisms.
Validate custom network plugins thoroughly before deployment.
Use overlay networks with encryption for sensitive multi-tenant environments.
Consider using Kubernetes NetworkPolicy or service mesh for more robust isolation.
No summary provided
What Happened:
CI/CD pipelines across the organization suddenly started failing with "unauthorized: authentication required" errors when trying to pull images from the private Docker registry.
Diagnosis Steps:
Verified registry status and connectivity.
Tested manual login with
docker login <registry>
.Checked registry logs for authentication failures.
Reviewed recent changes to authentication configuration.
Found that the registry authentication certificate had expired.
Root Cause:
The TLS certificate used for the Docker registry had expired, causing all authentication attempts to fail. The certificate expiration monitoring had not been properly configured.
Fix/Workaround:
• Generated and installed a new TLS certificate for the registry.
• Temporarily allowed insecure registry access for critical pipelines:
"insecure-registries": ["registry.example.com:5000"]
• Updated Docker daemon configurations across all build agents.
• Restarted the registry service and verified authentication.
Lessons Learned:
Registry authentication depends on valid TLS certificates that require lifecycle management.
How to Avoid:
Implement certificate expiration monitoring and alerting.
Use automated certificate renewal with tools like cert-manager or Let's Encrypt.
Document certificate renewal procedures.
Maintain a backup registry or mirror for critical images.
Test registry authentication regularly as part of infrastructure validation.
No summary provided
What Happened:
Build agents started experiencing Docker commands hanging indefinitely. New containers couldn't be created, and existing ones couldn't be managed. The entire CI/CD pipeline ground to a halt.
Diagnosis Steps:
Checked Docker daemon status with
systemctl status docker
.Examined Docker daemon logs with
journalctl -u docker
.Monitored system resources with
top
,free
, anddf
.Used
docker system df
to check Docker's resource usage.Found excessive number of unused images and containers.
Root Cause:
The Docker daemon had exhausted disk space due to accumulated images, containers, and volumes. The cleanup jobs had failed silently for weeks, allowing resources to accumulate.
Fix/Workaround:
• Manually freed space by removing unused resources:
docker system prune -af --volumes
• Restarted the Docker daemon.
• Implemented proper resource cleanup in CI/CD jobs.
• Added monitoring for Docker resource usage.
Lessons Learned:
Docker resources accumulate over time and require active management to prevent system-wide failures.
How to Avoid:
Implement regular automated cleanup of unused Docker resources.
Add monitoring for Docker disk usage with alerting thresholds.
Configure build jobs to clean up after themselves.
Use separate volumes for Docker data to isolate from system partitions.
Consider using container image garbage collection policies.
No summary provided
What Happened:
Developers reported random build failures where Docker couldn't find files that were definitely in the build context. The same Dockerfile would build successfully on some machines but fail on others.
Diagnosis Steps:
Verified file presence in the build context.
Tried building with
--no-cache
flag and observed consistent success.Examined Docker build cache with
docker builder prune -f
and observed improved reliability.Analyzed layer cache in
/var/lib/docker/overlay2/
.
Root Cause:
Docker's build cache had become corrupted, causing inconsistent behavior during layer creation. The corruption was likely due to abrupt daemon shutdowns during active builds.
Fix/Workaround:
• Cleared the Docker build cache:
docker builder prune -a -f
• Implemented more graceful shutdown procedures for the Docker daemon.
• Added build cache validation steps to CI/CD pipelines.
• Used BuildKit for more reliable caching behavior.
Lessons Learned:
Docker's build cache can become corrupted and cause mysterious, intermittent failures.
How to Avoid:
Periodically clear build cache in CI/CD environments.
Use BuildKit for improved cache handling.
Implement proper daemon shutdown procedures.
Add cache validation steps before critical builds.
Consider using remote build cache for consistent behavior across environments.
No summary provided
What Happened:
A security incident was detected where an attacker exploited a vulnerability to escape from a container and execute commands on the host system. The breach was discovered during a security audit.
Diagnosis Steps:
Reviewed security logs and found unusual process executions on the host.
Analyzed container configurations and discovered privileged mode was enabled.
Checked Docker version and found it had known CVEs related to container escapes.
Examined container runtime options and found dangerous capabilities granted.
Root Cause:
Multiple security misconfigurations combined: 1. Container was running in privileged mode 2. Docker daemon was running an outdated version with known vulnerabilities 3. Unnecessary Linux capabilities were granted to the container 4. Host filesystem was mounted inside the container
Fix/Workaround:
• Immediately stopped and removed the compromised containers.
• Updated Docker to the latest version with security patches.
• Removed privileged mode and unnecessary capabilities from all containers.
• Implemented proper volume mounting with read-only access where possible.
• Added AppArmor/SELinux profiles for additional container isolation.
Lessons Learned:
Container isolation is not absolute and requires careful security configuration.
How to Avoid:
Never run containers in privileged mode unless absolutely necessary.
Keep Docker and container runtimes updated with security patches.
Use security scanning tools to detect container vulnerabilities.
Implement defense in depth with multiple security layers.
Follow the principle of least privilege for container capabilities.
No summary provided
What Happened:
A monitoring system detected unusual network traffic from production containers. Investigation revealed that a cryptocurrency miner had been installed through a compromised dependency in a public Docker image.
Diagnosis Steps:
Identified suspicious network connections using
netstat -tulpn
.Used
ps aux
to find unusual processes consuming CPU resources.Analyzed the Docker image layers with
docker history
.Traced the malicious code to a dependency in a public base image.
Root Cause:
The team was using a public Docker image from Docker Hub without verification. The image maintainer's account had been compromised, and a recent update included malicious code.
Fix/Workaround:
• Immediately removed and replaced the compromised containers.
• Built custom images from verified base images.
• Implemented image scanning in the CI/CD pipeline.
• Created an internal registry with vetted images only.
Lessons Learned:
Public container images can introduce supply chain vulnerabilities if not properly verified.
How to Avoid:
Use official images or build your own from scratch.
Implement image scanning for vulnerabilities and malware.
Pin image versions with SHA digests, not just tags.
Maintain an internal registry of verified images.
Regularly audit and update base images.
No summary provided
What Happened:
During an incident investigation, the team discovered that container logs were missing for the past week. The logging system showed no data despite containers running normally.
Diagnosis Steps:
Checked Docker logging configuration with
docker info | grep Logging
.Verified logging driver settings in daemon.json.
Tested log generation with a test container.
Examined disk space and found the log partition was full.
Root Cause:
The logging driver was configured to use the json-file driver with no log rotation or size limits. The log files grew until they filled the disk, causing the Docker daemon to silently fail writing new logs.
Fix/Workaround:
• Freed up disk space by removing old log files.
• Configured log rotation and size limits:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
• Restarted the Docker daemon to apply changes.
• Implemented monitoring for log disk usage.
Lessons Learned:
Docker logging requires explicit configuration to prevent resource exhaustion.
How to Avoid:
Always configure log rotation and size limits.
Monitor log storage usage.
Consider using centralized logging solutions like fluentd or logstash.
Test logging configuration under load.
Implement alerts for logging failures.
No summary provided
What Happened:
Users reported slow response times in a containerized application. Monitoring showed average CPU usage at only 30%, but application performance metrics indicated CPU-related slowdowns.
Diagnosis Steps:
Analyzed container CPU metrics using
docker stats
.Checked for CPU throttling events with
docker inspect
.Used
docker run --cpus=X
to test different CPU limit configurations.Monitored CPU usage patterns with fine-grained metrics.
Root Cause:
The container had a CPU limit set that was too low for handling burst workloads. The application experienced CPU throttling during peak processing, causing latency spikes despite low average utilization.
Fix/Workaround:
• Increased CPU limits to accommodate burst workloads.
• Implemented CPU shares instead of hard limits for better resource sharing.
• Optimized the application to handle CPU throttling more gracefully.
• Added detailed monitoring for CPU throttling events.
Lessons Learned:
Container CPU limits can cause performance issues that aren't obvious from average utilization metrics.
How to Avoid:
Size container CPU limits based on peak usage, not averages.
Use CPU shares for flexible resource allocation in multi-tenant environments.
Monitor for CPU throttling events, not just utilization.
Test application performance under CPU constraints.
Consider using horizontal scaling instead of vertical for handling load variations.
No summary provided
What Happened:
Services running in a Docker Swarm cluster spanning multiple data centers experienced random network timeouts when communicating. The issue only occurred for larger data transfers between specific data centers.
Diagnosis Steps:
Tested connectivity with various packet sizes using ping with different payload sizes.
Analyzed network captures with tcpdump to observe packet fragmentation.
Compared MTU settings across different network paths.
Discovered MTU mismatch between overlay network and underlying physical network.
Root Cause:
The Docker overlay network was configured with the default MTU of 1500, but the VPN connection between data centers had a lower MTU of 1400. This caused packet fragmentation and, in some cases, dropped packets due to "DF" (Don't Fragment) flags.
Fix/Workaround:
• Adjusted the overlay network MTU to match the lowest MTU in the path:
docker network create --driver overlay --opt com.docker.network.driver.mtu=1400 my-network
• Updated existing networks by recreating them with the correct MTU.
• Implemented path MTU discovery monitoring.
Lessons Learned:
Network MTU mismatches can cause subtle, intermittent connectivity issues that are difficult to diagnose.
How to Avoid:
Always check MTU settings when spanning networks across different environments.
Test network connectivity with various packet sizes.
Document network MTU requirements for multi-datacenter deployments.
Consider using TCP MSS clamping as an alternative solution.
Implement monitoring for packet fragmentation and MTU-related issues.
No summary provided
What Happened:
A new deployment to production failed because the container was missing the application binary. The build process completed successfully, but the resulting container was essentially empty.
Diagnosis Steps:
Examined the Dockerfile and found a multi-stage build configuration.
Checked the final stage and discovered it wasn't correctly copying artifacts from the build stage.
Verified the build stage was producing the expected artifacts.
Tested the build locally with
docker build --no-cache
to reproduce the issue.
Root Cause:
The multi-stage Dockerfile had an incorrect COPY instruction in the final stage. It was using a path that didn't match where the build stage was outputting the compiled binary.
Fix/Workaround:
• Corrected the COPY instruction in the Dockerfile:
# Before
COPY --from=builder /go/bin/app /app
# After
COPY --from=builder /go/src/app/bin/app /app
• Added a verification step in the build process to check for the presence of critical files.
• Implemented a simple healthcheck in the container to validate the application's presence.
Lessons Learned:
Multi-stage builds require careful coordination between stages, especially regarding file paths.
How to Avoid:
Test Docker builds with
--no-cache
to ensure reproducibility.Add validation steps to verify the presence of critical artifacts.
Use consistent path conventions across build stages.
Implement container healthchecks to catch missing executables early.
Consider using Docker BuildKit which provides better error messages for multi-stage builds.
No summary provided
What Happened:
CI/CD pipelines started failing when building Docker images from a monorepo. The error message indicated that the build context was too large, causing timeouts during the initial context upload.
Diagnosis Steps:
Measured the build context size with
du -sh .
.Examined the repository structure and found large binary assets and test data.
Checked for the presence of a .dockerignore file.
Tested build with various context sizes to identify the threshold.
Root Cause:
The repository had grown to include large binary assets, test data, and historical artifacts. Without a proper .dockerignore file, all of these were being sent to the Docker daemon as part of the build context.
Fix/Workaround:
• Created a comprehensive .dockerignore file:
**/.git
**/node_modules
**/*.log
**/test-data
**/*.mp4
**/*.zip
**/dist
• Moved large binary assets to external storage.
• Restructured the repository to separate application code from data.
• Increased the Docker daemon timeout for larger builds.
Lessons Learned:
Docker build context size can significantly impact build performance and reliability.
How to Avoid:
Always use a .dockerignore file, especially in large repositories.
Regularly audit repository size and content.
Store large binary assets outside the main repository.
Consider using multi-repo architecture for very large projects.
Use Docker BuildKit which has better handling of large contexts.
No summary provided
What Happened:
Docker builds that previously took 2-3 minutes started taking 15-20 minutes, despite no significant changes to the application. The issue was particularly noticeable in CI/CD pipelines.
Diagnosis Steps:
Analyzed build times for each layer using
docker build --progress=plain
.Compared the Dockerfile with previous versions to identify changes.
Examined the dependency installation step which was taking the most time.
Tested builds with and without the build cache.
Root Cause:
A recent change to the Dockerfile had moved the COPY package.json before setting the working directory. This caused the dependency installation layer to be invalidated on every build, preventing effective caching.
Fix/Workaround:
• Reordered Dockerfile instructions to optimize caching:
# Before
COPY package.json .
WORKDIR /app
RUN npm install
# After
WORKDIR /app
COPY package.json .
RUN npm install
• Split the dependency installation into multiple layers for better granularity.
• Implemented a dependency cache volume in CI/CD pipelines.
Lessons Learned:
Docker layer caching is highly sensitive to the order of instructions and file changes.
How to Avoid:
Order Dockerfile instructions from least to most frequently changing.
Copy only necessary files for each step.
Use multi-stage builds to separate build dependencies from runtime.
Monitor build times and investigate sudden increases.
Consider using BuildKit's improved caching mechanisms.
No summary provided
What Happened:
A minor issue in one service caused its health check to fail. This triggered a restart loop that affected dependent services, eventually bringing down the entire application stack.
Diagnosis Steps:
Reviewed Docker Compose logs to identify the initial failing service.
Examined health check configurations in docker-compose.yml.
Analyzed the dependencies between services.
Tested the health check logic in isolation.
Root Cause:
A health check was configured with too strict parameters (interval: 5s, timeout: 3s, retries: 1) and was checking an endpoint that occasionally experienced normal latency spikes. When the health check failed, it triggered a restart, which caused dependent services to also restart due to the depends_on configuration.
Fix/Workaround:
• Adjusted health check parameters to be more tolerant:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
• Implemented circuit breaker patterns in service dependencies.
• Added graceful degradation for non-critical service failures.
Lessons Learned:
Overly sensitive health checks can cause more harm than good in a microservices environment.
How to Avoid:
Configure health checks with appropriate thresholds for the service's characteristics.
Implement start_period to allow services time to initialize.
Design services to degrade gracefully when dependencies are unavailable.
Use circuit breakers to prevent cascading failures.
Test failure scenarios as part of regular system validation.
No summary provided
What Happened:
A security audit discovered that production database credentials were visible in the Docker image history. Further investigation revealed that an attacker had already exploited these credentials to access sensitive data.
Diagnosis Steps:
Used
docker history --no-trunc <image>
to examine layer commands.Found API keys and database credentials passed as build arguments.
Checked the Dockerfile and found credentials directly in ENV and ARG instructions.
Reviewed access logs for the exposed services to identify unauthorized access.
Root Cause:
Secrets were being passed as build arguments and environment variables directly in the Dockerfile, which preserved them in the image metadata and layer history.
Fix/Workaround:
• Immediately rotated all exposed credentials.
• Implemented Docker secrets for runtime secret management.
• Used multi-stage builds to prevent secrets from appearing in the final image.
• Added secret scanning to the CI/CD pipeline.
Lessons Learned:
Docker image layers permanently store the commands used to create them, including any secrets passed in those commands.
How to Avoid:
Never use build arguments (ARG) for secrets.
Use Docker BuildKit's secret mounting:
```dockerfile
RUN --mount=type=secret,id=api_key cat /run/secrets/api_key
```
For older Docker versions, use multi-stage builds with temporary files.
Implement secret scanning in CI/CD pipelines.
Use runtime secret injection rather than build-time secrets.
No summary provided
What Happened:
A security incident was detected where an attacker gained root access to a host by exploiting a container with the Docker socket mounted. The attacker was able to create privileged containers and escape to the host.
Diagnosis Steps:
Reviewed container configurations and found
/var/run/docker.sock
mounted in a Jenkins agent container.Checked access logs and found suspicious container creations.
Analyzed audit logs to trace the attack path.
Verified which users and services had access to containers with the socket mounted.
Root Cause:
A Jenkins agent container had the Docker socket mounted to allow for Docker-in-Docker operations. This effectively gave the container—and anyone who could execute commands in it—full control over the Docker daemon and, by extension, the host.
Fix/Workaround:
• Immediately removed the Docker socket mount from all containers.
• Implemented a more secure Docker-in-Docker solution using Docker-outside-of-Docker (DooD) pattern.
• Added socket access controls using a proxy like docker-proxy.
• Implemented least privilege principles for CI/CD containers.
Lessons Learned:
Mounting the Docker socket in a container effectively gives that container root access to the host.
How to Avoid:
Never mount the Docker socket in containers unless absolutely necessary.
If required, use a proxy with access controls like docker-proxy or socks.
Implement container security scanning to detect socket mounts.
Use rootless Docker or Podman for safer container operations.
Consider alternatives like Kaniko for building containers without Docker socket access.
No summary provided
What Happened:
A critical production deployment failed because it pulled an incompatible version of a service. Investigation revealed that two different teams had pushed different images with the same tag to the shared registry.
Diagnosis Steps:
Compared the deployed image with the expected version using
docker inspect
.Checked image push history in the registry logs.
Reviewed CI/CD pipeline configurations across teams.
Found that two teams were using the same image name and tag convention.
Root Cause:
Two teams were using the same image repository and tag format in a shared registry. Team A pushed an image with tag v1.2.0
, which was later overwritten by Team B pushing a different application with the same tag.
Fix/Workaround:
• Implemented unique namespacing for each team's images:
registry.example.com/team-a/service-name:v1.2.0
registry.example.com/team-b/service-name:v1.2.0
• Added registry configuration to prevent tag overwrites.
• Implemented SHA digest pinning for critical deployments.
• Created an incident response to restore the correct image version.
Lessons Learned:
Docker tags are mutable and can lead to unexpected behavior if not properly managed.
How to Avoid:
Use namespaces to separate images by team or application.
Pin deployments to image digests (SHA256) instead of tags.
Configure registries to prevent tag overwrites.
Implement image signing and verification for critical applications.
Use semantic versioning consistently across all teams.
No summary provided
What Happened:
A long-running container started experiencing performance issues after several weeks of operation. CPU usage remained normal, but the application became increasingly unresponsive.
Diagnosis Steps:
Used
docker top <container>
to examine processes.Found numerous zombie processes (marked with 'Z' status).
Examined the application code and found child processes being spawned but not properly reaped.
Checked the container's init system configuration.
Root Cause:
The application was spawning child processes that were not being properly waited for. In a container environment without a proper init system, these became zombie processes that accumulated over time, consuming resources.
Fix/Workaround:
• Modified the application to properly handle child process termination.
• Implemented a proper init system in the container:
FROM node:16
# Add tini
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["node", "app.js"]
• Added regular container restarts as a temporary measure.
• Implemented monitoring for zombie process accumulation.
Lessons Learned:
Containers lack traditional init systems that handle zombie process reaping.
How to Avoid:
Use a lightweight init system like tini or dumb-init in containers.
Properly handle child processes in application code.
Implement regular health checks that detect zombie processes.
Consider using Docker's
--init
flag when running containers.Design applications to avoid spawning unnecessary child processes.
No summary provided
What Happened:
A high-traffic Java application suddenly crashed during peak load. The error logs showed "cannot fork" and "resource temporarily unavailable" errors, despite the container having plenty of CPU and memory resources.
Diagnosis Steps:
Examined container logs for error messages.
Checked system resource usage with
docker stats
.Used
docker exec <container> cat /proc/sys/kernel/pid_max
to check PID limits.Monitored process count with
docker exec <container> ps -ef | wc -l
.
Root Cause:
The container hit the default PID limit (typically 4096 processes/threads). The Java application was creating numerous threads under load, and the container's PID namespace restricted the total number of processes.
Fix/Workaround:
• Increased the PID limit for the container:
docker run --pids-limit=10000 <image>
• Optimized the application to use fewer threads.
• Implemented thread pooling with fixed maximum sizes.
• Added monitoring for process/thread count.
Lessons Learned:
Containers have resource limits beyond CPU and memory that can affect application performance.
How to Avoid:
Set appropriate PID limits based on application requirements.
Monitor thread/process creation patterns in containerized applications.
Use thread pools with reasonable maximum sizes.
Consider container orchestration platforms that allow PID limit configuration.
Test applications under load to identify potential resource constraints.
No summary provided
What Happened:
A production system was compromised through a known vulnerability in a base image. The vulnerability had been publicly disclosed months earlier, but the affected images hadn't been updated.
Diagnosis Steps:
Conducted emergency vulnerability scanning of all production images.
Identified multiple critical CVEs in base images.
Reviewed the image update and scanning processes.
Checked when the vulnerable images were last rebuilt.
Root Cause:
The team had implemented vulnerability scanning in CI/CD, but had no process for regularly rebuilding images to incorporate security patches. Images were only rebuilt when application code changed, leaving base image vulnerabilities unaddressed.
Fix/Workaround:
• Immediately rebuilt all images with the latest base images.
• Implemented automated weekly rebuilds of all images regardless of code changes.
• Added vulnerability scanning with blocking thresholds in CI/CD.
• Deployed runtime vulnerability scanning and container security monitoring.
Lessons Learned:
Image security requires both point-in-time scanning and ongoing maintenance.
How to Avoid:
Implement automated regular rebuilds of all images.
Use vulnerability scanning in CI/CD with appropriate blocking thresholds.
Subscribe to security advisories for base images.
Implement runtime container security monitoring.
Consider using minimal or distroless base images to reduce the attack surface.
No summary provided
What Happened:
As part of a security hardening initiative, the team migrated to rootless Docker. After the migration, several critical applications failed to start with various permission errors and port binding failures.
Diagnosis Steps:
Examined container logs for specific error messages.
Tested the same containers in rootful mode to confirm the issue was rootless-specific.
Reviewed the application's resource requirements and permissions.
Checked Docker's rootless mode documentation for known limitations.
Root Cause:
Multiple rootless mode limitations affected the application: 1. Inability to bind to privileged ports (<1024) 2. Limited network capabilities 3. Inability to mount certain device files 4. Restrictions on cgroup management
Fix/Workaround:
• Reconfigured applications to use non-privileged ports (>1024).
• Implemented a reverse proxy for services that needed standard ports.
• Modified container networking to use user-defined networks exclusively.
• Updated application code to handle rootless constraints.
Lessons Learned:
Rootless Docker improves security but introduces significant operational constraints.
How to Avoid:
Test applications thoroughly in rootless environments before migration.
Design applications to work without privileged capabilities.
Document rootless mode limitations and workarounds.
Consider using Podman which has better rootless support for some use cases.
Implement proper port mapping and reverse proxies for services requiring privileged ports.
No summary provided
What Happened:
A database container started experiencing high latency and timeout errors. The application logs showed disk I/O operations taking 10-100x longer than expected, despite the host having fast SSD storage.
Diagnosis Steps:
Monitored I/O performance using
iostat
anddocker stats
.Checked the storage driver configuration with
docker info
.Tested I/O performance directly on the host vs. inside containers.
Compared performance across different storage drivers.
Root Cause:
The default overlay2 storage driver was causing excessive I/O overhead for the database workload. The copy-on-write nature of the storage driver was particularly inefficient for the database's write-heavy workload.
Fix/Workaround:
• Switched to using volume mounts for database data:
volumes:
- /var/lib/postgresql/data:/var/lib/postgresql/data
• Considered alternative storage drivers like devicemapper in direct-lvm mode.
• Optimized the database configuration for containerized environments.
• Implemented I/O monitoring with alerts for performance degradation.
Lessons Learned:
Docker storage drivers have significant performance implications for I/O-intensive workloads.
How to Avoid:
Use volume mounts for I/O-intensive data instead of storing in container layers.
Benchmark different storage drivers for your specific workload.
Consider using host volumes for databases and other high I/O applications.
Monitor I/O performance regularly to detect degradation.
Document storage driver choices and their performance characteristics.
No summary provided
What Happened:
A high-priority application experienced intermittent performance issues despite having dedicated CPU and memory limits. The issues occurred unpredictably and weren't correlated with the application's own load.
Diagnosis Steps:
Monitored system-wide resource usage with
top
,iostat
, andvmstat
.Analyzed container resource usage with
docker stats
.Correlated performance issues with activity from other containers.
Tested the application in isolation on a dedicated host.
Root Cause:
While CPU and memory were properly limited, other containers on the same host were causing resource contention for shared resources: 1. Disk I/O bandwidth 2. Network bandwidth 3. CPU cache 4. Kernel resources
Fix/Workaround:
• Implemented I/O limits using the blkio cgroup:
docker run --device-write-bps /dev/sda:10mb <image>
• Separated critical workloads to dedicated hosts.
• Used the --cpu-shares
flag to prioritize important containers.
• Implemented network traffic shaping for better isolation.
Lessons Learned:
Container resource limits don't cover all shared resources, leading to noisy neighbor problems.
How to Avoid:
Implement comprehensive resource limits including I/O and network.
Use dedicated nodes for critical workloads.
Consider using Kubernetes with resource quotas and quality of service classes.
Monitor for resource contention across all subsystems.
Design applications to be resilient to resource variability.
No summary provided
What Happened:
Applications running in containers started experiencing intermittent DNS resolution failures for external domains. The issues occurred randomly and affected multiple containers across different hosts.
Diagnosis Steps:
Tested DNS resolution from inside containers using
nslookup
anddig
.Examined Docker daemon logs for DNS-related errors.
Checked
/etc/resolv.conf
inside containers and on the host.Monitored DNS query patterns and timing.
Root Cause:
Docker's embedded DNS server was becoming overwhelmed during periods of high container creation/deletion. Additionally, the default timeout for DNS queries was too short for some external domains with slow authoritative nameservers.
Fix/Workaround:
• Configured explicit DNS servers for the Docker daemon:
{
"dns": ["8.8.8.8", "8.8.4.4"]
}
• Increased the DNS timeout settings.
• Implemented DNS caching at the host level.
• Added retry logic for DNS resolution in application code.
Lessons Learned:
Docker's embedded DNS can become a bottleneck in high-scale environments.
How to Avoid:
Configure explicit, reliable DNS servers for Docker.
Implement proper DNS caching.
Monitor DNS resolution performance and failures.
Consider using CoreDNS or other advanced DNS solutions for large deployments.
Add resilience to DNS failures in application code.
No summary provided
What Happened:
A legacy application with a long history of incremental updates suddenly failed to build. The error message indicated that the maximum number of image layers had been exceeded.
Diagnosis Steps:
Analyzed the Dockerfile and found numerous RUN, COPY, and ADD instructions.
Used
docker history <image>
to count the number of layers.Compared with Docker's documented layer limits.
Reviewed the git history of the Dockerfile to see how it evolved.
Root Cause:
The Dockerfile had accumulated over 127 layers through years of incremental changes. Each developer had added new RUN commands without consolidating existing ones, eventually hitting Docker's layer limit.
Fix/Workaround:
• Consolidated multiple RUN commands using && and \ for line continuation:
# Before
RUN apt-get update
RUN apt-get install -y package1
RUN apt-get install -y package2
# After
RUN apt-get update && \
apt-get install -y package1 package2
• Combined multiple COPY operations.
• Implemented multi-stage builds to reduce final image layers.
• Added Dockerfile linting to the CI/CD pipeline.
Lessons Learned:
Docker images have a hard limit on the number of layers, requiring careful Dockerfile design.
How to Avoid:
Consolidate commands in Dockerfiles to minimize layer count.
Use multi-stage builds to reset layer count in the final image.
Implement Dockerfile linting and best practices in CI/CD.
Regularly audit and refactor Dockerfiles for long-lived projects.
Document layer usage and limits in development guidelines.
No summary provided
What Happened:
After a network partition event, containers could no longer discover each other using the built-in DNS resolution. This caused cascading failures as dependent services became unreachable.
Diagnosis Steps:
Verified network connectivity between hosts using ping and traceroute.
Checked Docker Swarm status with
docker node ls
and found some nodes in "down" state.Examined Docker daemon logs with
journalctl -u docker.service
.Used
docker service ls
to check service status and found multiple services in "pending" state.Tested DNS resolution from inside containers with
nslookup service_name
.
Root Cause:
A network partition had caused the Swarm cluster to split, with managers unable to reach each other. This led to a split-brain scenario where service discovery was inconsistent across the cluster.
Fix/Workaround:
• Restored network connectivity between manager nodes.
• Forced a single manager to be the leader by stopping Docker on other managers:
# On non-leader managers
systemctl stop docker
• Restarted Docker on the remaining manager to establish quorum:
# On the designated leader
systemctl restart docker
• Gradually brought other managers back online and verified cluster health.
• Implemented proper Raft consensus monitoring.
Lessons Learned:
Docker Swarm's service discovery depends on consistent cluster state and requires proper quorum for reliable operation.
How to Avoid:
Deploy manager nodes across different failure domains.
Implement network redundancy between critical infrastructure components.
Configure proper manager/worker ratio (3, 5, or 7 managers maximum).
Monitor Swarm cluster health with alerts for quorum issues.
Consider using Kubernetes for more robust orchestration in complex environments.
Implement application-level service discovery as a fallback mechanism.
No summary provided
What Happened:
A critical deployment to a high-security environment failed with cryptic errors. The logs showed that the container runtime was rejecting images due to signature verification failures.
Diagnosis Steps:
Examined container runtime logs for detailed error messages.
Verified image signatures using
docker trust inspect <image>
.Checked Notary server connectivity and certificate validity.
Reviewed recent changes to the CI/CD pipeline that builds and signs images.
Tested signature verification in a development environment.
Root Cause:
The organization had recently rotated the signing keys used in the CI/CD pipeline, but the new public keys hadn't been distributed to all production environments. Additionally, the old keys had been removed from the Notary server, breaking the chain of trust.
Fix/Workaround:
• Temporarily disabled signature verification for critical deployments:
// /etc/docker/daemon.json
{
"content-trust": false
}
• Distributed the new public keys to all environments:
# Export public key
docker trust key export <key-id> --public > pubkey.pem
# Import on target systems
docker trust key load pubkey.pem --name prod-signer
• Implemented proper key rotation procedures with overlapping validity periods.
• Added monitoring for signature verification failures.
Lessons Learned:
Cryptographic key rotation requires careful planning and coordination across all environments.
How to Avoid:
Maintain an inventory of all environments requiring signature verification.
Implement overlapping validity periods during key rotation.
Automate key distribution as part of infrastructure management.
Test signature verification in staging environments before production.
Use a centralized key management system with proper access controls.
Document key rotation procedures and test them regularly.
No summary provided
What Happened:
Developers reported that identical source code was producing different container images when built with BuildKit. This inconsistency was causing deployment issues and making it difficult to reproduce bugs.
Diagnosis Steps:
Compared image digests from different builds of the same code.
Examined BuildKit cache with
docker builder prune --filter until=24h
.Analyzed build logs with
BUILDKIT_PROGRESS=plain
for detailed output.Tested builds with and without the BuildKit cache.
Reviewed recent changes to the Dockerfile and build context.
Root Cause:
A combination of issues was causing the cache corruption:
1. A Dockerfile used the ADD
instruction with a URL that returned different content despite having the same ETag
2. BuildKit's content-addressable cache was correctly detecting the content change but not invalidating dependent layers
3. The CI/CD system was reusing the same builder instance across multiple projects
Fix/Workaround:
• Replaced ADD
with a more deterministic approach:
# Before
ADD https://example.com/resource.tar.gz /tmp/
# After
RUN curl -fsSL https://example.com/resource.tar.gz -o /tmp/resource.tar.gz && \
echo "expected-sha256sum /tmp/resource.tar.gz" | sha256sum -c
• Implemented dedicated builder instances for each project:
# In CI/CD configuration (GitHub Actions example)
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
with:
driver: docker-container
driver-opts: |
network=host
image=moby/buildkit:latest
• Added cache invalidation between builds:
docker builder prune --filter until=0s --force
Lessons Learned:
BuildKit's advanced caching can be affected by non-deterministic inputs that aren't obvious.
How to Avoid:
Use
RUN
with checksums instead ofADD
for remote resources.Pin dependencies with specific versions and checksums.
Implement isolated builder instances for critical projects.
Add explicit cache invalidation for sensitive builds.
Use BuildKit's
--mount=type=cache
for more controlled caching.Consider using Reproducible Builds practices for critical components.
No summary provided
What Happened:
Developers reported that a microservices application was behaving erratically in the development environment. Some services could communicate while others failed with connection timeouts or "host not found" errors.
Diagnosis Steps:
Examined Docker Compose logs for network-related errors.
Inspected container networks with
docker network inspect
.Tested DNS resolution between containers using
docker exec <container> nslookup <service>
.Reviewed the docker-compose.yml file for network configuration.
Found duplicate network aliases across different services.
Root Cause:
Multiple services in the docker-compose.yml file were using the same network alias, causing DNS resolution conflicts. When containers tried to connect to a service by name, they were randomly directed to one of the containers sharing that alias.
Fix/Workaround:
• Updated the docker-compose.yml file to use unique network aliases:
# Before
services:
api-v1:
networks:
app_net:
aliases:
- api
api-v2:
networks:
app_net:
aliases:
- api # Conflict!
# After
services:
api-v1:
networks:
app_net:
aliases:
- api-v1
api-v2:
networks:
app_net:
aliases:
- api-v2
# Add environment variable for backward compatibility
environment:
- SERVICE_NAME=api-v2
• Implemented a service discovery pattern using environment variables.
• Added a reverse proxy to route traffic based on path or headers.
Lessons Learned:
Docker Compose network aliases must be unique within a network to ensure reliable service discovery.
How to Avoid:
Use unique network aliases for each service.
Implement a naming convention that prevents collisions.
Consider using a service mesh or dedicated service discovery tool for complex applications.
Add automated validation of docker-compose.yml files in CI/CD.
Document network architecture and service discovery patterns.
No summary provided
What Happened:
A microservices application worked correctly in development but failed in production with configuration-related errors. The logs showed that environment variables were not being set correctly despite being defined in multiple places.
Diagnosis Steps:
Compared environment variables inside containers using
docker exec <container> env
.Reviewed all sources of environment variables (docker-compose.yml, .env files, shell environment).
Tested variable precedence with simplified examples.
Traced the application's configuration loading process.
Root Cause:
The team misunderstood Docker Compose's environment variable precedence rules. Variables defined in the shell environment were overriding those in the .env file, which in turn overrode those in docker-compose.yml. This caused different behavior between environments where shell variables were set differently.
Fix/Workaround:
• Documented the correct precedence order:
1. Compose file
2. Shell environment variables
3. Environment file
4. Dockerfile
• Updated the docker-compose.yml to use explicit environment variables for critical settings:
services:
api:
environment:
# High-priority settings that shouldn't be overridden
- "DATABASE_URL=postgres://user:pass@db:5432/api"
# Variables that can be overridden
- "LOG_LEVEL=${LOG_LEVEL:-info}"
• Implemented environment-specific compose files:
# Development
docker compose -f docker-compose.yml -f docker-compose.dev.yml up
# Production
docker compose -f docker-compose.yml -f docker-compose.prod.yml up
Lessons Learned:
Docker Compose environment variable precedence is complex and can lead to subtle configuration issues.
How to Avoid:
Document environment variable sources and precedence.
Use explicit values for critical configuration in compose files.
Implement environment-specific override files.
Add configuration validation at startup.
Consider using a dedicated configuration management tool for complex applications.
Test with clean environments to catch precedence issues.
No summary provided
What Happened:
Developers reported that a containerized application was failing with permission errors when trying to write to a directory mounted from the host. The same application worked correctly when run directly on the host.
Diagnosis Steps:
Examined container logs for specific permission denied errors.
Checked file permissions on the host directory with
ls -la
.Compared user IDs inside and outside the container with
id
.Verified SELinux/AppArmor status with
getenforce
andaa-status
.Tested with different mount options.
Root Cause:
The container was running as a non-root user (UID 1000) inside the container, but the mounted directory on the host was owned by root with permissions 755 (rwxr-xr-x). Additionally, SELinux was in enforcing mode, adding another layer of access control.
Fix/Workaround:
• Short-term: Changed ownership of the host directory to match the container's user:
# Find the user ID inside the container
docker exec <container> id
# Change ownership on the host
sudo chown -R 1000:1000 /path/to/mounted/directory
• Long-term: Updated the Dockerfile to use a consistent user ID and added proper volume configuration:
# Dockerfile
FROM node:18-alpine
# Create app user with explicit UID/GID
RUN addgroup -g 1099 appgroup && \
adduser -u 1099 -G appgroup -h /app -D appuser
USER appuser
WORKDIR /app
# docker-compose.yml
services:
app:
build: .
volumes:
- type: bind
source: ./data
target: /app/data
consistency: delegated
• For SELinux environments, added the proper context:
sudo chcon -Rt container_file_t /path/to/mounted/directory
Lessons Learned:
Container file permissions involve both traditional Unix permissions and additional security contexts.
How to Avoid:
Use consistent UIDs/GIDs between containers and host.
Document required permissions for mounted volumes.
Consider using named volumes instead of bind mounts for better isolation.
For SELinux environments, use the :Z or :z suffix for bind mounts.
Implement proper permission checks in application startup scripts.
Use Docker Compose's user directive to match host user when in development.
No summary provided
What Happened:
A production service was repeatedly crashing and restarting, causing downstream services to experience connection errors. The rapid restart cycle prevented the application from properly initializing, exacerbating the problem.
Diagnosis Steps:
Examined container status with
docker ps -a
and found a container with a high restart count.Checked container logs with
docker logs <container>
to identify crash reasons.Reviewed the restart policy configuration in docker-compose.yml and deployment scripts.
Analyzed application startup behavior and dependencies.
Root Cause:
The container was configured with a restart policy of "always" but had a dependency on a service that wasn't yet available. This created a rapid restart loop where the container would start, fail to connect to its dependency, crash, and immediately restart without any backoff.
Fix/Workaround:
• Modified the restart policy to include a delay:
# docker-compose.yml
services:
app:
restart: on-failure:5 # Limit to 5 restarts
# Or with Swarm/Kubernetes
deploy:
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 5
window: 120s
• Implemented proper dependency checking in the application startup:
// In Go
func waitForDependencies(ctx context.Context) error {
backoff := time.Second
maxBackoff := time.Minute
maxAttempts := 10
for attempt := 0; attempt < maxAttempts; attempt++ {
if err := checkDependencies(); err != nil {
log.Printf("Dependencies not ready (attempt %d/%d): %v",
attempt+1, maxAttempts, err)
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(backoff):
// Exponential backoff with jitter
backoff = time.Duration(float64(backoff) * 1.5)
if backoff > maxBackoff {
backoff = maxBackoff
}
continue
}
}
return nil
}
return errors.New("dependencies not available after maximum attempts")
}
• Added health checks to prevent premature dependency on the service.
Lessons Learned:
Container restart policies need careful configuration to prevent restart storms.
How to Avoid:
Use restart policies with appropriate delays and limits.
Implement graceful dependency checking with backoff in applications.
Add readiness probes/health checks to prevent premature service discovery.
Monitor container restart counts and alert on excessive restarts.
Consider using init containers (in Kubernetes) or entrypoint scripts to validate dependencies.
No summary provided
What Happened:
After migrating some production workloads to ARM64-based servers, containers failed to start with cryptic errors. The same containers worked fine on x86_64 servers.
Diagnosis Steps:
Examined container startup logs with
docker logs <container>
.Inspected image details with
docker inspect <image>
.Checked image manifest with
docker manifest inspect <image>
.Verified architecture support with
docker buildx imagetools inspect <image>
.Tested running the container with different runtime flags.
Root Cause:
The multi-architecture image was correctly built with both AMD64 and ARM64 variants, but the ARM64 variant contained native libraries that were compiled for the wrong ARM architecture variant (ARMv7 instead of ARMv8/ARM64).
Fix/Workaround:
• Updated the Dockerfile to use architecture-specific base images and build steps:
# syntax=docker/dockerfile:1.4
FROM --platform=$BUILDPLATFORM golang:1.20-alpine AS builder
ARG TARGETPLATFORM
ARG BUILDPLATFORM
RUN echo "Building on $BUILDPLATFORM for $TARGETPLATFORM"
WORKDIR /app
COPY . .
# Architecture-specific build steps
RUN case "$TARGETPLATFORM" in \
"linux/amd64") GOARCH=amd64 make build-amd64 ;; \
"linux/arm64") GOARCH=arm64 make build-arm64 ;; \
*) echo "Unsupported platform: $TARGETPLATFORM" && exit 1 ;; \
esac
# Use distroless for smaller, more secure images
FROM --platform=$TARGETPLATFORM gcr.io/distroless/static:nonroot
COPY --from=builder /app/bin/app /app
ENTRYPOINT ["/app"]
• Built the image with BuildKit's multi-architecture support:
docker buildx build --platform linux/amd64,linux/arm64 \
-t myorg/myapp:latest \
--push .
• Added architecture-specific testing in CI/CD:
# GitHub Actions example
jobs:
test-multiarch:
runs-on: ubuntu-latest
strategy:
matrix:
platform: [linux/amd64, linux/arm64]
steps:
- uses: actions/checkout@v3
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Build and test
run: |
docker buildx build --platform ${{ matrix.platform }} \
--load -t test-image .
docker run --rm --platform ${{ matrix.platform }} test-image ./run-tests.sh
Lessons Learned:
Multi-architecture images require testing on each target architecture.
How to Avoid:
Test multi-architecture images on all target platforms.
Use BuildKit's multi-platform capabilities for consistent builds.
Implement architecture-specific CI/CD testing.
Be cautious with native dependencies in multi-architecture images.
Document architecture-specific requirements and limitations.
No summary provided
What Happened:
A legacy application with a long history of incremental updates suddenly failed to build. The error message indicated that the maximum number of image layers had been exceeded.
Diagnosis Steps:
Verified containers were on the same network using
docker network inspect bridge
.Checked iptables rules with
iptables -L -n
and found DROP rules for inter-container traffic.Examined Docker daemon logs with
journalctl -u docker.service
.Discovered that a security tool had modified iptables rules.
Examined the Dockerfile and found no multi-stage builds.
Used
docker history <image>
to analyze layer sizes.Discovered large temporary files and build artifacts in the final image.
Found that the .dockerignore file was missing, causing unnecessary files to be included.
Reviewed container logs showing OOMKilled errors.
Checked memory settings with
docker stats
during high load.Analyzed JVM arguments and found no explicit memory settings.
Used
jcmd <pid> VM.native_memory
inside the container to check memory usage.Checked container logs for error messages related to data access.
Inspected volume configurations with
docker volume ls
anddocker volume inspect
.Reviewed the docker-compose.yml and Dockerfile for volume definitions.
Examined deployment scripts and found containers were being removed with the
-v
flag.Verified network configurations using
docker network inspect
.Tested connectivity between containers with
docker exec <container> ping <other-container>
.Analyzed iptables rules and Docker network settings.
Discovered a custom network plugin that was incorrectly configured.
Verified registry status and connectivity.
Tested manual login with
docker login <registry>
.Checked registry logs for authentication failures.
Reviewed recent changes to authentication configuration.
Found that the registry authentication certificate had expired.
Checked Docker daemon status with
systemctl status docker
.Examined Docker daemon logs with
journalctl -u docker
.Monitored system resources with
top
,free
, anddf
.Used
docker system df
to check Docker's resource usage.Found excessive number of unused images and containers.
Verified file presence in the build context.
Tried building with
--no-cache
flag and observed consistent success.Examined Docker build cache with
docker builder prune -f
and observed improved reliability.Analyzed layer cache in
/var/lib/docker/overlay2/
.Reviewed security logs and found unusual process executions on the host.
Analyzed container configurations and discovered privileged mode was enabled.
Checked Docker version and found it had known CVEs related to container escapes.
Examined container runtime options and found dangerous capabilities granted.
Identified suspicious network connections using
netstat -tulpn
.Used
ps aux
to find unusual processes consuming CPU resources.Analyzed the Docker image layers with
docker history
.Traced the malicious code to a dependency in a public base image.
Checked Docker logging configuration with
docker info | grep Logging
.Verified logging driver settings in daemon.json.
Tested log generation with a test container.
Examined disk space and found the log partition was full.
Analyzed container CPU metrics using
docker stats
.Checked for CPU throttling events with
docker inspect
.Used
docker run --cpus=X
to test different CPU limit configurations.Monitored CPU usage patterns with fine-grained metrics.
Tested connectivity with various packet sizes using ping with different payload sizes.
Analyzed network captures with tcpdump to observe packet fragmentation.
Compared MTU settings across different network paths.
Discovered MTU mismatch between overlay network and underlying physical network.
Examined the Dockerfile and found a multi-stage build configuration.
Checked the final stage and discovered it wasn't correctly copying artifacts from the build stage.
Verified the build stage was producing the expected artifacts.
Tested the build locally with
docker build --no-cache
to reproduce the issue.Measured the build context size with
du -sh .
.Examined the repository structure and found large binary assets and test data.
Checked for the presence of a .dockerignore file.
Tested build with various context sizes to identify the threshold.
Analyzed build times for each layer using
docker build --progress=plain
.Compared the Dockerfile with previous versions to identify changes.
Examined the dependency installation step which was taking the most time.
Tested builds with and without the build cache.
Reviewed Docker Compose logs to identify the initial failing service.
Examined health check configurations in docker-compose.yml.
Analyzed the dependencies between services.
Tested the health check logic in isolation.
Used
docker history --no-trunc <image>
to examine layer commands.Found API keys and database credentials passed as build arguments.
Checked the Dockerfile and found credentials directly in ENV and ARG instructions.
Reviewed access logs for the exposed services to identify unauthorized access.
Reviewed container configurations and found
/var/run/docker.sock
mounted in a Jenkins agent container.Checked access logs and found suspicious container creations.
Analyzed audit logs to trace the attack path.
Verified which users and services had access to containers with the socket mounted.
Compared the deployed image with the expected version using
docker inspect
.Checked image push history in the registry logs.
Reviewed CI/CD pipeline configurations across teams.
Found that two teams were using the same image name and tag convention.
Used
docker top <container>
to examine processes.Found numerous zombie processes (marked with 'Z' status).
Examined the application code and found child processes being spawned but not properly reaped.
Checked the container's init system configuration.
Examined container logs for error messages.
Checked system resource usage with
docker stats
.Used
docker exec <container> cat /proc/sys/kernel/pid_max
to check PID limits.Monitored process count with
docker exec <container> ps -ef | wc -l
.Conducted emergency vulnerability scanning of all production images.
Identified multiple critical CVEs in base images.
Reviewed the image update and scanning processes.
Checked when the vulnerable images were last rebuilt.
Examined container logs for specific error messages.
Tested the same containers in rootful mode to confirm the issue was rootless-specific.
Reviewed the application's resource requirements and permissions.
Checked Docker's rootless mode documentation for known limitations.
Monitored I/O performance using
iostat
anddocker stats
.Checked the storage driver configuration with
docker info
.Tested I/O performance directly on the host vs. inside containers.
Compared performance across different storage drivers.
Monitored system-wide resource usage with
top
,iostat
, andvmstat
.Analyzed container resource usage with
docker stats
.Correlated performance issues with activity from other containers.
Tested the application in isolation on a dedicated host.
Tested DNS resolution from inside containers using
nslookup
anddig
.Examined Docker daemon logs for DNS-related errors.
Checked
/etc/resolv.conf
inside containers and on the host.Monitored DNS query patterns and timing.
Analyzed the Dockerfile and found numerous RUN, COPY, and ADD instructions.
Used
docker history <image>
to count the number of layers.Compared with Docker's documented layer limits.
Reviewed the git history of the Dockerfile to see how it evolved.
Root Cause:
The Dockerfile had accumulated over 127 layers through years of incremental changes. Each developer had added new RUN commands without consolidating existing ones, eventually hitting Docker's layer limit.
Fix/Workaround:
• Temporarily restored connectivity by running:
iptables -I FORWARD -i docker0 -o docker0 -j ACCEPT
• Modified the security hardening script to whitelist Docker networks.
• Restarted the Docker daemon to apply changes.
• Implemented multi-stage builds to separate build and runtime environments.
• Added a comprehensive .dockerignore file.
• Combined RUN commands to reduce layer count and clean up in the same layer.
• Final image size reduced from 4GB to 200MB.
• Added explicit JVM memory settings:
JAVA_OPTS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0"
• Updated to Java 11 which has better container awareness.
• Adjusted container memory request and limit to realistic values based on application profiling.
• Restored data from the most recent backup.
• Modified docker-compose.yml to use named volumes instead of anonymous volumes:
volumes:
app_data:
name: app_data
• Updated deployment scripts to preserve volumes during container recreation.
• Added volume backup steps to the deployment process.
• Temporarily reverted to the default bridge driver with manual isolation.
• Reconfigured the network plugin with proper isolation rules.
• Implemented additional network policy enforcement with iptables.
• Added network security monitoring to detect cross-network traffic.
• Generated and installed a new TLS certificate for the registry.
• Temporarily allowed insecure registry access for critical pipelines:
"insecure-registries": ["registry.example.com:5000"]
• Updated Docker daemon configurations across all build agents.
• Restarted the registry service and verified authentication.
• Manually freed space by removing unused resources:
docker system prune -af --volumes
• Restarted the Docker daemon.
• Implemented proper resource cleanup in CI/CD jobs.
• Added monitoring for Docker resource usage.
• Cleared the Docker build cache:
docker builder prune -a -f
• Implemented more graceful shutdown procedures for the Docker daemon.
• Added build cache validation steps to CI/CD pipelines.
• Used BuildKit for more reliable caching behavior.
• Immediately stopped and removed the compromised containers.
• Updated Docker to the latest version with security patches.
• Removed privileged mode and unnecessary capabilities from all containers.
• Implemented proper volume mounting with read-only access where possible.
• Added AppArmor/SELinux profiles for additional container isolation.
• Immediately removed and replaced the compromised containers.
• Built custom images from verified base images.
• Implemented image scanning in the CI/CD pipeline.
• Created an internal registry with vetted images only.
• Freed up disk space by removing old log files.
• Configured log rotation and size limits:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
• Restarted the Docker daemon to apply changes.
• Implemented monitoring for log disk usage.
• Increased CPU limits to accommodate burst workloads.
• Implemented CPU shares instead of hard limits for better resource sharing.
• Optimized the application to handle CPU throttling more gracefully.
• Added detailed monitoring for CPU throttling events.
• Adjusted the overlay network MTU to match the lowest MTU in the path:
docker network create --driver overlay --opt com.docker.network.driver.mtu=1400 my-network
• Updated existing networks by recreating them with the correct MTU.
• Implemented path MTU discovery monitoring.
• Corrected the COPY instruction in the Dockerfile:
# Before
COPY --from=builder /go/bin/app /app
# After
COPY --from=builder /go/src/app/bin/app /app
• Added a verification step in the build process to check for the presence of critical files.
• Implemented a simple healthcheck in the container to validate the application's presence.
• Created a comprehensive .dockerignore file:
**/.git
**/node_modules
**/*.log
**/test-data
**/*.mp4
**/*.zip
**/dist
• Moved large binary assets to external storage.
• Restructured the repository to separate application code from data.
• Increased the Docker daemon timeout for larger builds.
• Reordered Dockerfile instructions to optimize caching:
# Before
COPY package.json .
WORKDIR /app
RUN npm install
# After
WORKDIR /app
COPY package.json .
RUN npm install
• Split the dependency installation into multiple layers for better granularity.
• Implemented a dependency cache volume in CI/CD pipelines.
• Adjusted health check parameters to be more tolerant:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
• Implemented circuit breaker patterns in service dependencies.
• Added graceful degradation for non-critical service failures.
• Immediately rotated all exposed credentials.
• Implemented Docker secrets for runtime secret management.
• Used multi-stage builds to prevent secrets from appearing in the final image.
• Added secret scanning to the CI/CD pipeline.
• Immediately removed the Docker socket mount from all containers.
• Implemented a more secure Docker-in-Docker solution using Docker-outside-of-Docker (DooD) pattern.
• Added socket access controls using a proxy like docker-proxy.
• Implemented least privilege principles for CI/CD containers.
• Implemented unique namespacing for each team's images:
registry.example.com/team-a/service-name:v1.2.0
registry.example.com/team-b/service-name:v1.2.0
• Added registry configuration to prevent tag overwrites.
• Implemented SHA digest pinning for critical deployments.
• Created an incident response to restore the correct image version.
• Modified the application to properly handle child process termination.
• Implemented a proper init system in the container:
FROM node:16
# Add tini
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["node", "app.js"]
• Added regular container restarts as a temporary measure.
• Implemented monitoring for zombie process accumulation.
• Increased the PID limit for the container:
docker run --pids-limit=10000 <image>
• Optimized the application to use fewer threads.
• Implemented thread pooling with fixed maximum sizes.
• Added monitoring for process/thread count.
• Immediately rebuilt all images with the latest base images.
• Implemented automated weekly rebuilds of all images regardless of code changes.
• Added vulnerability scanning with blocking thresholds in CI/CD.
• Deployed runtime vulnerability scanning and container security monitoring.
• Reconfigured applications to use non-privileged ports (>1024).
• Implemented a reverse proxy for services that needed standard ports.
• Modified container networking to use user-defined networks exclusively.
• Updated application code to handle rootless constraints.
• Switched to using volume mounts for database data:
volumes:
- /var/lib/postgresql/data:/var/lib/postgresql/data
• Considered alternative storage drivers like devicemapper in direct-lvm mode.
• Optimized the database configuration for containerized environments.
• Implemented I/O monitoring with alerts for performance degradation.
• Implemented I/O limits using the blkio cgroup:
docker run --device-write-bps /dev/sda:10mb <image>
• Separated critical workloads to dedicated hosts.
• Used the --cpu-shares
flag to prioritize important containers.
• Implemented network traffic shaping for better isolation.
• Configured explicit DNS servers for the Docker daemon:
{
"dns": ["8.8.8.8", "8.8.4.4"]
}
• Increased the DNS timeout settings.
• Implemented DNS caching at the host level.
• Added retry logic for DNS resolution in application code.
• Consolidated multiple RUN commands using && and \ for line continuation:
# Before
RUN apt-get update
RUN apt-get install -y package1
RUN apt-get install -y package2
# After
RUN apt-get update && \
apt-get install -y package1 package2
• Combined multiple COPY operations.
• Implemented multi-stage builds to reduce final image layers.
• Added Dockerfile linting to the CI/CD pipeline.
Lessons Learned:
Docker images have a hard limit on the number of layers, requiring careful Dockerfile design.
How to Avoid:
Document all security tools and their effects on networking.
Test container connectivity after system updates.
Use Docker's user-defined networks instead of the default bridge for better isolation and DNS resolution.
Scenario Summary: Deployments failing due to excessively large Docker images taking too long to pull.
Use multi-stage builds for compiled applications.
Implement and maintain a proper .dockerignore file.
Audit image sizes as part of CI/CD pipeline.
Use tools like DockerSlim or dive to analyze and optimize images.
Scenario Summary: Java application repeatedly crashing with OOMKilled errors despite having memory limits set.
Use container-aware JVM versions (Java 10+).
Always set explicit memory limits for JVM applications.
Test applications under memory pressure before production deployment.
Monitor container memory usage patterns to set appropriate limits.
Scenario Summary: Critical application data lost when containers were recreated during deployment.
Always use named volumes for persistent data.
Never use the
-v
flag withdocker rm
unless data loss is acceptable.Include volume backup in deployment procedures.
Document volume architecture and persistence requirements.
Test data persistence during container recreation in staging environments.
Scenario Summary: Security incident where containers from different tenants could communicate despite network isolation.
Regularly test network isolation with security scans.
Implement defense in depth with multiple isolation mechanisms.
Validate custom network plugins thoroughly before deployment.
Use overlay networks with encryption for sensitive multi-tenant environments.
Consider using Kubernetes NetworkPolicy or service mesh for more robust isolation.
Scenario Summary: All builds failing due to inability to pull images from private Docker registry.
Implement certificate expiration monitoring and alerting.
Use automated certificate renewal with tools like cert-manager or Let's Encrypt.
Document certificate renewal procedures.
Maintain a backup registry or mirror for critical images.
Test registry authentication regularly as part of infrastructure validation.
Scenario Summary: Docker daemon became unresponsive, causing all container operations to hang.
Implement regular automated cleanup of unused Docker resources.
Add monitoring for Docker disk usage with alerting thresholds.
Configure build jobs to clean up after themselves.
Use separate volumes for Docker data to isolate from system partitions.
Consider using container image garbage collection policies.
Scenario Summary: Intermittent build failures with mysterious "file not found" errors despite files being present in the build context.
Periodically clear build cache in CI/CD environments.
Use BuildKit for improved cache handling.
Implement proper daemon shutdown procedures.
Add cache validation steps before critical builds.
Consider using remote build cache for consistent behavior across environments.
Scenario Summary: Security breach where an attacker escaped container isolation and gained access to the host system.
Never run containers in privileged mode unless absolutely necessary.
Keep Docker and container runtimes updated with security patches.
Use security scanning tools to detect container vulnerabilities.
Implement defense in depth with multiple security layers.
Follow the principle of least privilege for container capabilities.
Scenario Summary: Production system compromised through a malicious package in a public Docker image.
Use official images or build your own from scratch.
Implement image scanning for vulnerabilities and malware.
Pin image versions with SHA digests, not just tags.
Maintain an internal registry of verified images.
Regularly audit and update base images.
Scenario Summary: Container logs disappeared, hampering incident investigation and compliance requirements.
Always configure log rotation and size limits.
Monitor log storage usage.
Consider using centralized logging solutions like fluentd or logstash.
Test logging configuration under load.
Implement alerts for logging failures.
Scenario Summary: Application performance degraded significantly despite low average CPU utilization.
Size container CPU limits based on peak usage, not averages.
Use CPU shares for flexible resource allocation in multi-tenant environments.
Monitor for CPU throttling events, not just utilization.
Test application performance under CPU constraints.
Consider using horizontal scaling instead of vertical for handling load variations.
Scenario Summary: Intermittent network timeouts between containers in different data centers.
Always check MTU settings when spanning networks across different environments.
Test network connectivity with various packet sizes.
Document network MTU requirements for multi-datacenter deployments.
Consider using TCP MSS clamping as an alternative solution.
Implement monitoring for packet fragmentation and MTU-related issues.
Scenario Summary: Production deployment failed due to missing artifacts in multi-stage Docker build.
Test Docker builds with
--no-cache
to ensure reproducibility.Add validation steps to verify the presence of critical artifacts.
Use consistent path conventions across build stages.
Implement container healthchecks to catch missing executables early.
Consider using Docker BuildKit which provides better error messages for multi-stage builds.
Scenario Summary: Docker builds failing with "context deadline exceeded" errors due to excessive build context size.
Always use a .dockerignore file, especially in large repositories.
Regularly audit repository size and content.
Store large binary assets outside the main repository.
Consider using multi-repo architecture for very large projects.
Use Docker BuildKit which has better handling of large contexts.
Scenario Summary: Build times increased dramatically despite layer caching, causing deployment delays.
Order Dockerfile instructions from least to most frequently changing.
Copy only necessary files for each step.
Use multi-stage builds to separate build dependencies from runtime.
Monitor build times and investigate sudden increases.
Consider using BuildKit's improved caching mechanisms.
Scenario Summary: Entire application stack crashed due to cascading failures triggered by a single service's health check.
Configure health checks with appropriate thresholds for the service's characteristics.
Implement start_period to allow services time to initialize.
Design services to degrade gracefully when dependencies are unavailable.
Use circuit breakers to prevent cascading failures.
Test failure scenarios as part of regular system validation.
Scenario Summary: Sensitive credentials exposed in Docker image history, leading to a security breach.
Never use build arguments (ARG) for secrets.
Use Docker BuildKit's secret mounting:
```dockerfile
RUN --mount=type=secret,id=api_key cat /run/secrets/api_key
```
For older Docker versions, use multi-stage builds with temporary files.
Implement secret scanning in CI/CD pipelines.
Use runtime secret injection rather than build-time secrets.
Scenario Summary: Security breach due to Docker socket being mounted in a container, allowing privilege escalation.
Never mount the Docker socket in containers unless absolutely necessary.
If required, use a proxy with access controls like docker-proxy or socks.
Implement container security scanning to detect socket mounts.
Use rootless Docker or Podman for safer container operations.
Consider alternatives like Kaniko for building containers without Docker socket access.
Scenario Summary: Production deployment used the wrong image version due to tag collision in the shared registry.
Use namespaces to separate images by team or application.
Pin deployments to image digests (SHA256) instead of tags.
Configure registries to prevent tag overwrites.
Implement image signing and verification for critical applications.
Use semantic versioning consistently across all teams.
Scenario Summary: Container performance degraded over time due to accumulation of zombie processes.
Use a lightweight init system like tini or dumb-init in containers.
Properly handle child processes in application code.
Implement regular health checks that detect zombie processes.
Consider using Docker's
--init
flag when running containers.Design applications to avoid spawning unnecessary child processes.
Scenario Summary: Application crashed with "cannot fork" errors due to hitting PID limits in the container.
Set appropriate PID limits based on application requirements.
Monitor thread/process creation patterns in containerized applications.
Use thread pools with reasonable maximum sizes.
Consider container orchestration platforms that allow PID limit configuration.
Test applications under load to identify potential resource constraints.
Scenario Summary: Critical vulnerability exploited in production due to inadequate image scanning and update processes.
Implement automated regular rebuilds of all images.
Use vulnerability scanning in CI/CD with appropriate blocking thresholds.
Subscribe to security advisories for base images.
Implement runtime container security monitoring.
Consider using minimal or distroless base images to reduce the attack surface.
Scenario Summary: Application deployment failed in rootless Docker environment due to unexpected limitations.
Test applications thoroughly in rootless environments before migration.
Design applications to work without privileged capabilities.
Document rootless mode limitations and workarounds.
Consider using Podman which has better rootless support for some use cases.
Implement proper port mapping and reverse proxies for services requiring privileged ports.
Scenario Summary: Container I/O performance degraded significantly, causing application timeouts and failures.
Use volume mounts for I/O-intensive data instead of storing in container layers.
Benchmark different storage drivers for your specific workload.
Consider using host volumes for databases and other high I/O applications.
Monitor I/O performance regularly to detect degradation.
Document storage driver choices and their performance characteristics.
Scenario Summary: Critical service degraded due to noisy neighbor containers consuming shared resources.
Implement comprehensive resource limits including I/O and network.
Use dedicated nodes for critical workloads.
Consider using Kubernetes with resource quotas and quality of service classes.
Monitor for resource contention across all subsystems.
Design applications to be resilient to resource variability.
Scenario Summary: Containers intermittently failed to resolve external domain names, causing application errors.
Configure explicit, reliable DNS servers for Docker.
Implement proper DNS caching.
Monitor DNS resolution performance and failures.
Consider using CoreDNS or other advanced DNS solutions for large deployments.
Add resilience to DNS failures in application code.
Scenario Summary: Build failed with "max depth exceeded" error due to excessive image layers.
Consolidate commands in Dockerfiles to minimize layer count.
Use multi-stage builds to reset layer count in the final image.
Implement Dockerfile linting and best practices in CI/CD.
Regularly audit and refactor Dockerfiles for long-lived projects.
Document layer usage and limits in development guidelines.