A new feature passed all tests in development and staging but failed in production. Investigation revealed significant differences in server configurations between environments.
# Configuration Management Scenarios
No summary provided
What Happened:
Diagnosis Steps:
Compared environment configurations using Ansible fact gathering.
Created a diff of configuration files across environments.
Reviewed recent manual changes to production servers.
Checked Ansible playbook execution history.
Root Cause:
Production servers had been modified manually during incident response. These changes weren't documented or added to configuration management, causing drift between environments.
Fix/Workaround:
• Documented all configuration differences between environments.
• Updated Ansible playbooks to reflect the current state of production.
• Applied standardized configurations across all environments.
• Implemented configuration validation checks.
Lessons Learned:
Manual changes outside of configuration management lead to environment inconsistency and unpredictable behavior.
How to Avoid:
Implement immutable infrastructure where possible.
Use configuration management for all environment changes.
Add configuration drift detection to monitoring.
Create break-glass procedures that include documenting and incorporating emergency changes.
Regularly validate environment consistency with automated tests.
No summary provided
What Happened:
After a successful deployment to development and staging environments, the same application deployment to production resulted in service failures. The application logs showed connection timeouts to dependent services despite all connectivity tests passing.
Diagnosis Steps:
Compared application configurations across environments.
Reviewed Ansible playbooks and inventory files.
Examined environment-specific variables and templates.
Checked for manual changes on production servers.
Analyzed application logs for specific connection parameters.
Root Cause:
Production environment had been manually modified with custom configuration settings that weren't captured in the configuration management system. These manual changes were overwritten during the automated deployment, causing the application to use incorrect connection parameters.
Fix/Workaround:
• Short-term: Restored the critical configuration settings manually:
# Restore critical settings
ssh prod-server "sudo sed -i 's/timeout=30/timeout=120/' /etc/app/config.yml"
ssh prod-server "sudo systemctl restart app-service"
• Long-term: Implemented proper configuration management with Ansible:
# inventory/group_vars/production.yml
---
app_config:
timeout: 120
retry_attempts: 5
connection_pool_size: 50
# inventory/group_vars/staging.yml
---
app_config:
timeout: 30
retry_attempts: 3
connection_pool_size: 20
# playbooks/templates/config.yml.j2
---
application:
name: example-app
environment: {{ environment_name }}
connections:
database:
host: {{ db_host }}
port: {{ db_port }}
timeout: {{ app_config.timeout }}
retry_attempts: {{ app_config.retry_attempts }}
cache:
host: {{ cache_host }}
port: {{ cache_port }}
connection_pool_size: {{ app_config.connection_pool_size }}
• Added configuration validation checks to the deployment process:
# playbooks/validate_config.yml
---
- name: Validate configuration
hosts: all
tasks:
- name: Extract current configuration
shell: grep -A 10 'connections:' /etc/app/config.yml
register: current_config
changed_when: false
- name: Display configuration differences
debug:
msg: "Current configuration differs from expected"
when: current_config.stdout != expected_config_output
- name: Fail if configuration validation is enabled and config differs
fail:
msg: "Configuration validation failed. Run with --extra-vars 'skip_validation=true' to bypass."
when:
- current_config.stdout != expected_config_output
- skip_validation is not defined or not skip_validation
Lessons Learned:
Configuration management must capture all environment-specific settings and prevent manual drift.
How to Avoid:
Implement comprehensive configuration management with version control.
Document all environment-specific configurations.
Add configuration validation to deployment processes.
Implement immutable infrastructure where possible.
Use automated drift detection and remediation.
No summary provided
What Happened:
A security audit discovered that production database credentials were visible in plain text in CI/CD build logs. The credentials had been exposed for several months, creating a significant security risk.
Diagnosis Steps:
Examined CI/CD build logs to confirm credential exposure.
Reviewed how credentials were being passed to the build process.
Checked application configuration files in the repository.
Analyzed Jenkins job configurations and pipeline scripts.
Traced the credential flow from source to build logs.
Root Cause:
The application used environment variables for configuration, including sensitive credentials. The CI/CD pipeline was configured to echo all environment variables for debugging purposes, which caused the credentials to be printed in the build logs. Additionally, the credentials were stored directly in the pipeline configuration rather than using a secure credential store.
Fix/Workaround:
• Short-term: Immediately rotated all exposed credentials.
• Removed debug statements that printed environment variables:
// Jenkinsfile - Before
stage('Debug') {
steps {
sh 'printenv | sort' // Dangerous!
}
}
// Jenkinsfile - After
stage('Debug') {
steps {
sh '''
# Print only safe environment variables
printenv | grep -v "PASSWORD\\|SECRET\\|KEY\\|TOKEN" | sort
'''
}
}
• Implemented proper secret management using Jenkins Credentials:
// Jenkinsfile with proper credential handling
pipeline {
agent any
environment {
// Credentials bound as environment variables
DB_CREDS = credentials('production-db-credentials')
}
stages {
stage('Deploy') {
steps {
// DB_CREDS_USR and DB_CREDS_PSW are automatically masked in logs
sh '''
echo "Connecting to database as user: $DB_CREDS_USR"
./deploy.sh
'''
}
}
}
}
• Added credential masking in build logs:
// Jenkins plugin configuration
import hudson.util.Secret;
import org.jenkinsci.plugins.credentialsbinding.masking.SecretPatternFactory;
public class CustomSecretPatternFactory extends SecretPatternFactory {
@Override
public Pattern createPattern(Secret secret) {
String plainText = secret.getPlainText();
// Create pattern that matches the secret even with surrounding characters
return Pattern.compile("(?i).*" + Pattern.quote(plainText) + ".*");
}
}
Lessons Learned:
Configuration management must include proper handling of secrets throughout the entire pipeline.
How to Avoid:
Use dedicated secret management solutions (HashiCorp Vault, AWS Secrets Manager, etc.).
Implement credential masking in CI/CD systems.
Avoid printing environment variables in build logs.
Regularly audit build logs for exposed secrets.
Implement automated secret scanning in CI/CD pipelines.
No summary provided
What Happened:
A retail company deployed a new feature to their e-commerce platform. The feature worked perfectly in development and staging environments but failed in production. Investigation revealed that configuration drift had occurred between environments, with production having significantly different configurations than development and staging. The drift had accumulated over time due to manual hotfixes and emergency changes that bypassed the standard configuration management process.
Diagnosis Steps:
Compared configuration files across all environments.
Reviewed change history and deployment logs.
Analyzed application behavior differences.
Examined infrastructure provisioning processes.
Audited manual change procedures and emergency protocols.
Root Cause:
The investigation revealed multiple issues with configuration management: 1. Emergency changes were made directly in production without proper documentation 2. Configuration validation was not part of the deployment pipeline 3. No automated drift detection was implemented 4. Configuration templates contained environment-specific hardcoding 5. Different teams had different access patterns to configuration management
Fix/Workaround:
• Implemented immediate fixes to align production with other environments
• Created comprehensive configuration validation in the CI/CD pipeline
• Developed automated drift detection and alerting
• Refactored configuration templates to remove hardcoding
• Established clear protocols for emergency changes
Lessons Learned:
Configuration drift can occur gradually and remain undetected until it causes significant issues.
How to Avoid:
Implement automated configuration validation as part of CI/CD.
Create regular drift detection and reporting.
Establish clear protocols for emergency changes with proper documentation.
Use infrastructure as code to manage all environment configurations.
Implement configuration auditing and compliance checking.