Several production servers became unresponsive. Users reported application errors, and SSH access was intermittent. Investigation revealed that the root filesystem was 100% full.
# Linux Administration Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Attempted to SSH into affected servers with difficulty.
Ran
df -h
to check filesystem usage.Used
du -sh /*
to identify large directories.Found
/var/log
consuming most of the space with rotated but not compressed logs.
Root Cause:
Log rotation was configured to keep too many old logs, and compression was failing. Additionally, a verbose debug logging had been enabled during troubleshooting and never disabled.
Fix/Workaround:
• Freed space immediately by removing old logs:
find /var/log -name "*.log.*" -mtime +7 -delete
• Disabled verbose logging in application configuration.
• Fixed logrotate configuration to properly compress and expire old logs.
• Restarted affected services.
Lessons Learned:
Filesystem monitoring is critical, and log management needs careful attention.
How to Avoid:
Implement monitoring and alerting for filesystem usage (warn at 80%, critical at 90%).
Configure proper log rotation with compression and reasonable retention.
Separate logs to a dedicated partition or volume.
Implement centralized logging to reduce local storage requirements.
Create automated cleanup jobs for temporary files.
No summary provided
What Happened:
After applying monthly security patches, a production database server failed to boot properly, experiencing kernel panic during startup. The system was stuck in a reboot loop.
Diagnosis Steps:
Accessed the server via out-of-band management console.
Reviewed kernel panic messages in the console output.
Booted into an older kernel version using GRUB menu.
Checked
/var/log/yum.log
to identify recently updated packages.
Root Cause:
A kernel security update conflicted with a custom storage driver used for database optimization. The driver wasn't compatible with the new kernel version.
Fix/Workaround:
• Booted using the previous kernel version by selecting it in GRUB.
• Pinned the kernel package to prevent automatic updates:
echo "exclude=kernel*" >> /etc/yum.conf
• Contacted the storage driver vendor for an updated driver compatible with the new kernel.
• Tested the updated driver in a staging environment before applying to production.
Lessons Learned:
Kernel updates require careful testing, especially with custom drivers or modules.
How to Avoid:
Maintain a test environment that mirrors production configurations.
Test all kernel updates in staging before production deployment.
Keep a record of custom kernel modules and drivers.
Configure GRUB to keep multiple kernel versions for fallback.
Implement a gradual rollout strategy for kernel updates.
No summary provided
What Happened:
A production web server started returning 500 errors for some requests. Initial investigation showed plenty of free disk space, but the application logs showed file creation errors.
Diagnosis Steps:
Checked disk space with
df -h
showing sufficient free space.Examined application logs showing "No space left on device" errors.
Checked inode usage with
df -i
revealing 100% inode usage.Used
find / -xdev -type f | sort | uniq -c | sort -n
to identify directories with many files.Discovered thousands of small temporary files in the application's cache directory.
Root Cause:
The application was creating many small temporary files in its cache directory but not cleaning them up properly. Over time, this exhausted all available inodes on the filesystem, even though plenty of disk space remained.
Fix/Workaround:
• Short-term: Manually cleaned up unnecessary temporary files:
# Find and remove old temporary files
find /var/www/app/cache -type f -name "temp_*" -mtime +7 -delete
• Implemented a proper cleanup script and added it to cron:
#!/bin/bash
# /usr/local/bin/cleanup-temp-files.sh
APP_DIR="/var/www/app"
LOG_FILE="/var/log/cleanup-temp.log"
echo "$(date): Starting cleanup" >> $LOG_FILE
# Clean files older than 1 day
find $APP_DIR/cache -type f -name "temp_*" -mtime +1 -delete
find $APP_DIR/logs -type f -name "*.log" -mtime +30 -delete
# Count remaining files
REMAINING=$(find $APP_DIR/cache -type f | wc -l)
echo "$(date): Cleanup complete. $REMAINING files remain in cache." >> $LOG_FILE
# Check inode usage
INODE_USAGE=$(df -i / | awk 'NR==2 {print $5}')
echo "$(date): Current inode usage: $INODE_USAGE" >> $LOG_FILE
# Alert if inode usage is still high
if [[ ${INODE_USAGE%\%} -gt 80 ]]; then
echo "WARNING: Inode usage is still high at $INODE_USAGE" | mail -s "High Inode Usage Alert" admin@example.com
fi
# Add to crontab
0 2 * * * /usr/local/bin/cleanup-temp-files.sh
• Modified application code to use a proper temporary file management approach.
Lessons Learned:
Disk space monitoring must include inode usage, not just space usage.
How to Avoid:
Monitor both disk space and inode usage.
Implement proper temporary file cleanup in applications.
Use appropriate filesystem types based on expected file patterns.
Consider using tmpfs for temporary files where appropriate.
Implement log rotation with proper file count limits.
No summary provided
What Happened:
A production database server was experiencing random restarts during peak load periods. The database logs showed abrupt terminations without proper shutdown sequences, and system logs revealed that the Linux OOM killer was terminating the database process.
Diagnosis Steps:
Examined system logs with
journalctl -k | grep -i "out of memory"
.Checked memory usage patterns with
free -m
andvmstat
.Analyzed process memory consumption with
ps aux --sort=-%mem
.Reviewed the OOM killer score for processes with
cat /proc/<pid>/oom_score
.Monitored memory allocation with
dmesg -w
during peak loads.
Root Cause:
The database was configured with memory settings that didn't account for how Linux manages memory. The database was allocating large amounts of memory for its buffer pool, but not actually using it immediately. This caused the kernel to perceive memory pressure and trigger the OOM killer, which selected the database process due to its high memory usage.
Fix/Workaround:
• Short-term: Adjusted the OOM killer score to protect the database process:
# Create a systemd override file
mkdir -p /etc/systemd/system/postgresql.service.d/
cat > /etc/systemd/system/postgresql.service.d/oom.conf << EOF
[Service]
OOMScoreAdjust=-1000
EOF
# Reload systemd and restart the service
systemctl daemon-reload
systemctl restart postgresql
• Long-term: Properly tuned the system and database memory settings:
# Update kernel memory management parameters in /etc/sysctl.conf
vm.swappiness = 10
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
# Apply changes
sysctl -p
# Update PostgreSQL memory settings in postgresql.conf
shared_buffers = '8GB' # Reduced from 16GB
work_mem = '64MB' # Reduced from 128MB
maintenance_work_mem = '1GB' # Reduced from 2GB
• Implemented proper monitoring for memory usage and OOM events.
Lessons Learned:
Linux memory management and application memory usage patterns must be carefully aligned.
How to Avoid:
Understand how the Linux OOM killer selects processes.
Configure appropriate OOM score adjustments for critical services.
Tune application memory settings based on actual system resources.
Implement memory usage monitoring with alerts.
Consider using cgroups to enforce memory limits for applications.
No summary provided
What Happened:
Users reported significant slowdowns across multiple applications hosted on a Linux server. Initial monitoring showed CPU and memory usage at only 60-70%, leading to confusion about the cause. The operations team observed that the system was becoming increasingly unresponsive, with simple commands taking minutes to complete. The issue worsened during peak hours and temporarily improved during low-traffic periods.
Diagnosis Steps:
Analyzed system load averages and process states.
Examined disk I/O statistics and wait times.
Reviewed process resource consumption patterns.
Checked for resource limits and cgroup configurations.
Monitored system calls and kernel wait states.
Root Cause:
The investigation revealed multiple resource contention issues: 1. A backup process was causing excessive disk I/O, creating contention 2. Several processes were competing for the same disk resources 3. The I/O scheduler was configured suboptimally for the workload 4. No I/O limits were set for background processes 5. The filesystem was heavily fragmented, exacerbating I/O issues
Fix/Workaround:
• Implemented immediate fixes to restore performance
• Rescheduled backup processes to low-traffic periods
• Configured I/O limits for background processes
• Adjusted the I/O scheduler to better handle mixed workloads
• Planned filesystem defragmentation during maintenance window
Lessons Learned:
System performance issues can be caused by resource contention even when overall utilization appears moderate.
How to Avoid:
Implement proper I/O scheduling and limits for all processes.
Monitor disk I/O statistics alongside CPU and memory.
Schedule resource-intensive background tasks during low-traffic periods.
Regularly defragment filesystems and optimize storage performance.
Use cgroups to manage resource allocation between applications.