# Staggered Machine Update and Reboot System Research ## Overview Research into implementing an automated system for updating and rebooting all lab machines in a staggered fashion using Nix, cronjobs, and our existing lab tool infrastructure. ## Goals - Minimize downtime by updating machines in waves - Ensure system stability with gradual rollouts - Leverage Nix's atomic updates and rollback capabilities - Integrate with existing lab tool for orchestration - Provide monitoring and failure recovery ## Architecture Components ### 1. Update Controller - Central orchestrator running on management node - Maintains machine groups and update schedules - Coordinates staggered execution - Monitors update progress and health ### 2. Machine Groups ``` Group 1: Non-critical services (dev environments, testing) Group 2: Infrastructure services (monitoring, logging) Group 3: Critical services (databases, core applications) Group 4: Management nodes (controllers, orchestrators) ``` ### 3. Nix Integration - Use `nixos-rebuild switch` for atomic updates - Leverage Nix generations for rollback capability - Update channels/flakes before rebuilding - Validate configuration before applying ### 4. Lab Tool Integration - Extend lab tool with update management commands - Machine inventory and grouping - Health check integration - Status reporting and logging ## Implementation Strategy ### Phase 1: Basic Staggered Updates ```bash # Example workflow per machine group lab update prepare --group=dev lab update execute --group=dev --wait-for-completion lab update verify --group=dev lab update prepare --group=infrastructure # Continue with next group... ``` ### Phase 2: Enhanced Orchestration - Dependency-aware scheduling - Health checks before proceeding to next group - Automatic rollback on failures - Notification system ### Phase 3: Advanced Features - Blue-green deployments for critical services - Canary releases - Integration with monitoring systems - Custom update policies per service ## Cronjob Design ### Master Cron Schedule ```cron # Weekly full system update - Sundays at 2 AM 0 2 * * 0 /home/geir/Home-lab/scripts/staggered-update.sh # Daily security updates for critical machines 0 3 * * * /home/geir/Home-lab/scripts/security-update.sh --group=critical # Health check and cleanup 0 1 * * * /home/geir/Home-lab/scripts/update-health-check.sh ``` ### Update Script Structure ```bash #!/usr/bin/env bash # staggered-update.sh set -euo pipefail # Configuration GROUPS=("dev" "infrastructure" "critical" "management") STAGGER_DELAY=30m MAX_PARALLEL=3 # Log setup LOG_DIR="/var/log/lab-updates" LOG_FILE="$LOG_DIR/update-$(date +%Y%m%d-%H%M%S).log" exec > >(tee -a "$LOG_FILE") 2>&1 for group in "${GROUPS[@]}"; do echo "Starting update for group: $group" # Pre-update checks lab health-check --group="$group" || { echo "Health check failed for $group, skipping" continue } # Update Nix channels/flakes lab update prepare --group="$group" # Execute updates with parallelism control lab update execute --group="$group" --parallel="$MAX_PARALLEL" # Verify updates lab update verify --group="$group" || { echo "Verification failed for $group, initiating rollback" lab update rollback --group="$group" # Send alert lab notify --level=error --message="Update failed for $group, rolled back" exit 1 } echo "Group $group updated successfully, waiting $STAGGER_DELAY" sleep "$STAGGER_DELAY" done echo "All groups updated successfully" lab notify --level=info --message="Staggered update completed successfully" ``` ## Nix Configuration Management ### Centralized Configuration ```nix # /home/geir/Home-lab/nix/update-config.nix { updateGroups = { dev = { machines = [ "dev-01" "dev-02" "test-env" ]; updatePolicy = "aggressive"; maintenanceWindow = "02:00-06:00"; allowReboot = true; }; infrastructure = { machines = [ "monitor-01" "log-server" "backup-01" ]; updatePolicy = "conservative"; maintenanceWindow = "03:00-05:00"; allowReboot = true; dependencies = [ "dev" ]; }; critical = { machines = [ "db-primary" "web-01" "web-02" ]; updatePolicy = "manual-approval"; maintenanceWindow = "04:00-05:00"; allowReboot = false; # Requires manual reboot dependencies = [ "infrastructure" ]; }; }; updateSettings = { maxParallel = 3; healthCheckTimeout = 300; rollbackOnFailure = true; notificationChannels = [ "email" "discord" ]; }; } ``` ### Machine-Specific Update Configurations ```nix # On each machine: /etc/nixos/update-config.nix { services.lab-updater = { enable = true; group = "infrastructure"; preUpdateScript = '' # Stop non-critical services systemctl stop some-service ''; postUpdateScript = '' # Restart services and verify systemctl start some-service curl -f http://localhost:8080/health ''; rollbackScript = '' # Custom rollback procedures systemctl stop some-service nixos-rebuild switch --rollback systemctl start some-service ''; }; } ``` ## Lab Tool Extensions ### New Commands ```bash # Update management lab update prepare [--group=GROUP] [--machine=MACHINE] lab update execute [--group=GROUP] [--parallel=N] [--dry-run] lab update verify [--group=GROUP] lab update rollback [--group=GROUP] [--to-generation=N] lab update status [--group=GROUP] # Health and monitoring lab health-check [--group=GROUP] [--timeout=SECONDS] lab update-history [--group=GROUP] [--days=N] lab notify [--level=LEVEL] [--message=MSG] [--channel=CHANNEL] # Configuration lab update-config show [--group=GROUP] lab update-config set [--group=GROUP] [--key=KEY] [--value=VALUE] ``` ### Integration Points ```python # lab/commands/update.py class UpdateCommand: def prepare(self, group=None, machine=None): """Prepare machines for updates""" # Update Nix channels/flakes # Pre-update health checks # Download packages def execute(self, group=None, parallel=1, dry_run=False): """Execute updates on machines""" # Run nixos-rebuild switch # Monitor progress # Handle failures def verify(self, group=None): """Verify updates completed successfully""" # Check system health # Verify services # Compare generations ``` ## Monitoring and Alerting ### Health Checks - Service availability checks - Resource usage monitoring - System log analysis - Network connectivity tests ### Alerting Triggers - Update failures - Health check failures - Rollback events - Long-running updates ### Notification Channels - Email notifications - Discord/Slack integration - Dashboard updates - Log aggregation ## Safety Mechanisms ### Pre-Update Validation - Configuration syntax checking - Dependency verification - Resource availability checks - Backup verification ### During Update - Progress monitoring - Timeout handling - Partial failure recovery - Emergency stop capability ### Post-Update - Service verification - Performance monitoring - Automatic rollback triggers - Success confirmation ## Rollback Strategy ### Automatic Rollback Triggers - Health check failures - Service startup failures - Critical error detection - Timeout exceeded ### Manual Rollback ```bash # Quick rollback to previous generation lab update rollback --group=critical --immediate # Rollback to specific generation lab update rollback --group=infrastructure --to-generation=150 # Selective rollback (specific machines) lab update rollback --machine=db-primary,web-01 ``` ## Testing Strategy ### Development Environment - Test updates in isolated environment - Validate scripts and configurations - Performance testing - Failure scenario testing ### Staging Rollout - Deploy to staging group first - Automated testing suite - Manual verification - Production deployment ## Security Considerations - Secure communication channels - Authentication for update commands - Audit logging - Access control for update scripts - Encrypted configuration storage ## Future Enhancements ### Advanced Scheduling - Maintenance window management - Business hour awareness - Holiday scheduling - Emergency update capabilities ### Intelligence Features - Machine learning for optimal timing - Predictive failure detection - Automatic dependency discovery - Performance impact analysis ### Integration Expansions - CI/CD pipeline integration - Cloud provider APIs - Container orchestration - Configuration management systems ## Implementation Roadmap ### Phase 1 (2 weeks) - Basic staggered update script - Simple group management - Nix integration - Basic health checks ### Phase 2 (4 weeks) - Lab tool integration - Advanced scheduling - Monitoring and alerting - Rollback mechanisms ### Phase 3 (6 weeks) - Advanced features - Performance optimization - Extended integrations - Documentation and training ## Resources and References - NixOS Manual: System Administration - Cron Best Practices - Blue-Green Deployment Patterns - Infrastructure as Code Principles - Monitoring and Observability Patterns ## Conclusion A well-designed staggered update system will significantly improve lab maintenance efficiency while reducing risk. The combination of Nix's atomic updates, careful orchestration, and comprehensive monitoring provides a robust foundation for automated infrastructure management.