some research and loose thoughts

2025-06-20 15:32:34 +02:00 · 2025-06-20 15:32:34 +02:00 · 12fb56f35b
commit 12fb56f35b
parent 076c38d829
7 changed files with 2160 additions and 5 deletions
--- a/research/staggered-update-system.md
+++ b/research/staggered-update-system.md
@ -0,0 +1,402 @@
+# Staggered Machine Update and Reboot System Research
+
+## Overview
+
+Research into implementing an automated system for updating and rebooting all lab machines in a staggered fashion using Nix, cronjobs, and our existing lab tool infrastructure.
+
+## Goals
+
+- Minimize downtime by updating machines in waves
+- Ensure system stability with gradual rollouts
+- Leverage Nix's atomic updates and rollback capabilities
+- Integrate with existing lab tool for orchestration
+- Provide monitoring and failure recovery
+
+## Architecture Components
+
+### 1. Update Controller
+
+- Central orchestrator running on management node
+- Maintains machine groups and update schedules
+- Coordinates staggered execution
+- Monitors update progress and health
+
+### 2. Machine Groups
+
+```
+Group 1: Non-critical services (dev environments, testing)
+Group 2: Infrastructure services (monitoring, logging)
+Group 3: Critical services (databases, core applications)
+Group 4: Management nodes (controllers, orchestrators)
+```
+
+### 3. Nix Integration
+
+- Use `nixos-rebuild switch` for atomic updates
+- Leverage Nix generations for rollback capability
+- Update channels/flakes before rebuilding
+- Validate configuration before applying
+
+### 4. Lab Tool Integration
+
+- Extend lab tool with update management commands
+- Machine inventory and grouping
+- Health check integration
+- Status reporting and logging
+
+## Implementation Strategy
+
+### Phase 1: Basic Staggered Updates
+
+```bash
+# Example workflow per machine group
+lab update prepare --group=dev
+lab update execute --group=dev --wait-for-completion
+lab update verify --group=dev
+lab update prepare --group=infrastructure
+# Continue with next group...
+```
+
+### Phase 2: Enhanced Orchestration
+
+- Dependency-aware scheduling
+- Health checks before proceeding to next group
+- Automatic rollback on failures
+- Notification system
+
+### Phase 3: Advanced Features
+
+- Blue-green deployments for critical services
+- Canary releases
+- Integration with monitoring systems
+- Custom update policies per service
+
+## Cronjob Design
+
+### Master Cron Schedule
+
+```cron
+# Weekly full system update - Sundays at 2 AM
+0 2 * * 0 /home/geir/Home-lab/scripts/staggered-update.sh
+
+# Daily security updates for critical machines
+0 3 * * * /home/geir/Home-lab/scripts/security-update.sh --group=critical
+
+# Health check and cleanup
+0 1 * * * /home/geir/Home-lab/scripts/update-health-check.sh
+```
+
+### Update Script Structure
+
+```bash
+#!/usr/bin/env bash
+# staggered-update.sh
+
+set -euo pipefail
+
+# Configuration
+GROUPS=("dev" "infrastructure" "critical" "management")
+STAGGER_DELAY=30m
+MAX_PARALLEL=3
+
+# Log setup
+LOG_DIR="/var/log/lab-updates"
+LOG_FILE="$LOG_DIR/update-$(date +%Y%m%d-%H%M%S).log"
+
+exec > >(tee -a "$LOG_FILE") 2>&1
+
+for group in "${GROUPS[@]}"; do
+    echo "Starting update for group: $group"
+    
+    # Pre-update checks
+    lab health-check --group="$group" || {
+        echo "Health check failed for $group, skipping"
+        continue
+    }
+    
+    # Update Nix channels/flakes
+    lab update prepare --group="$group"
+    
+    # Execute updates with parallelism control
+    lab update execute --group="$group" --parallel="$MAX_PARALLEL"
+    
+    # Verify updates
+    lab update verify --group="$group" || {
+        echo "Verification failed for $group, initiating rollback"
+        lab update rollback --group="$group"
+        # Send alert
+        lab notify --level=error --message="Update failed for $group, rolled back"
+        exit 1
+    }
+    
+    echo "Group $group updated successfully, waiting $STAGGER_DELAY"
+    sleep "$STAGGER_DELAY"
+done
+
+echo "All groups updated successfully"
+lab notify --level=info --message="Staggered update completed successfully"
+```
+
+## Nix Configuration Management
+
+### Centralized Configuration
+
+```nix
+# /home/geir/Home-lab/nix/update-config.nix
+{
+  updateGroups = {
+    dev = {
+      machines = [ "dev-01" "dev-02" "test-env" ];
+      updatePolicy = "aggressive";
+      maintenanceWindow = "02:00-06:00";
+      allowReboot = true;
+    };
+    
+    infrastructure = {
+      machines = [ "monitor-01" "log-server" "backup-01" ];
+      updatePolicy = "conservative";
+      maintenanceWindow = "03:00-05:00";
+      allowReboot = true;
+      dependencies = [ "dev" ];
+    };
+    
+    critical = {
+      machines = [ "db-primary" "web-01" "web-02" ];
+      updatePolicy = "manual-approval";
+      maintenanceWindow = "04:00-05:00";
+      allowReboot = false;  # Requires manual reboot
+      dependencies = [ "infrastructure" ];
+    };
+  };
+  
+  updateSettings = {
+    maxParallel = 3;
+    healthCheckTimeout = 300;
+    rollbackOnFailure = true;
+    notificationChannels = [ "email" "discord" ];
+  };
+}
+```
+
+### Machine-Specific Update Configurations
+
+```nix
+# On each machine: /etc/nixos/update-config.nix
+{
+  services.lab-updater = {
+    enable = true;
+    group = "infrastructure";
+    preUpdateScript = ''
+      # Stop non-critical services
+      systemctl stop some-service
+    '';
+    postUpdateScript = ''
+      # Restart services and verify
+      systemctl start some-service
+      curl -f http://localhost:8080/health
+    '';
+    rollbackScript = ''
+      # Custom rollback procedures
+      systemctl stop some-service
+      nixos-rebuild switch --rollback
+      systemctl start some-service
+    '';
+  };
+}
+```
+
+## Lab Tool Extensions
+
+### New Commands
+
+```bash
+# Update management
+lab update prepare [--group=GROUP] [--machine=MACHINE]
+lab update execute [--group=GROUP] [--parallel=N] [--dry-run]
+lab update verify [--group=GROUP]
+lab update rollback [--group=GROUP] [--to-generation=N]
+lab update status [--group=GROUP]
+
+# Health and monitoring
+lab health-check [--group=GROUP] [--timeout=SECONDS]
+lab update-history [--group=GROUP] [--days=N]
+lab notify [--level=LEVEL] [--message=MSG] [--channel=CHANNEL]
+
+# Configuration
+lab update-config show [--group=GROUP]
+lab update-config set [--group=GROUP] [--key=KEY] [--value=VALUE]
+```
+
+### Integration Points
+
+```python
+# lab/commands/update.py
+class UpdateCommand:
+    def prepare(self, group=None, machine=None):
+        """Prepare machines for updates"""
+        # Update Nix channels/flakes
+        # Pre-update health checks
+        # Download packages
+        
+    def execute(self, group=None, parallel=1, dry_run=False):
+        """Execute updates on machines"""
+        # Run nixos-rebuild switch
+        # Monitor progress
+        # Handle failures
+        
+    def verify(self, group=None):
+        """Verify updates completed successfully"""
+        # Check system health
+        # Verify services
+        # Compare generations
+```
+
+## Monitoring and Alerting
+
+### Health Checks
+
+- Service availability checks
+- Resource usage monitoring
+- System log analysis
+- Network connectivity tests
+
+### Alerting Triggers
+
+- Update failures
+- Health check failures
+- Rollback events
+- Long-running updates
+
+### Notification Channels
+
+- Email notifications
+- Discord/Slack integration
+- Dashboard updates
+- Log aggregation
+
+## Safety Mechanisms
+
+### Pre-Update Validation
+
+- Configuration syntax checking
+- Dependency verification
+- Resource availability checks
+- Backup verification
+
+### During Update
+
+- Progress monitoring
+- Timeout handling
+- Partial failure recovery
+- Emergency stop capability
+
+### Post-Update
+
+- Service verification
+- Performance monitoring
+- Automatic rollback triggers
+- Success confirmation
+
+## Rollback Strategy
+
+### Automatic Rollback Triggers
+
+- Health check failures
+- Service startup failures
+- Critical error detection
+- Timeout exceeded
+
+### Manual Rollback
+
+```bash
+# Quick rollback to previous generation
+lab update rollback --group=critical --immediate
+
+# Rollback to specific generation
+lab update rollback --group=infrastructure --to-generation=150
+
+# Selective rollback (specific machines)
+lab update rollback --machine=db-primary,web-01
+```
+
+## Testing Strategy
+
+### Development Environment
+
+- Test updates in isolated environment
+- Validate scripts and configurations
+- Performance testing
+- Failure scenario testing
+
+### Staging Rollout
+
+- Deploy to staging group first
+- Automated testing suite
+- Manual verification
+- Production deployment
+
+## Security Considerations
+
+- Secure communication channels
+- Authentication for update commands
+- Audit logging
+- Access control for update scripts
+- Encrypted configuration storage
+
+## Future Enhancements
+
+### Advanced Scheduling
+
+- Maintenance window management
+- Business hour awareness
+- Holiday scheduling
+- Emergency update capabilities
+
+### Intelligence Features
+
+- Machine learning for optimal timing
+- Predictive failure detection
+- Automatic dependency discovery
+- Performance impact analysis
+
+### Integration Expansions
+
+- CI/CD pipeline integration
+- Cloud provider APIs
+- Container orchestration
+- Configuration management systems
+
+## Implementation Roadmap
+
+### Phase 1 (2 weeks)
+
+- Basic staggered update script
+- Simple group management
+- Nix integration
+- Basic health checks
+
+### Phase 2 (4 weeks)
+
+- Lab tool integration
+- Advanced scheduling
+- Monitoring and alerting
+- Rollback mechanisms
+
+### Phase 3 (6 weeks)
+
+- Advanced features
+- Performance optimization
+- Extended integrations
+- Documentation and training
+
+## Resources and References
+
+- NixOS Manual: System Administration
+- Cron Best Practices
+- Blue-Green Deployment Patterns
+- Infrastructure as Code Principles
+- Monitoring and Observability Patterns
+
+## Conclusion
+
+A well-designed staggered update system will significantly improve lab maintenance efficiency while reducing risk. The combination of Nix's atomic updates, careful orchestration, and comprehensive monitoring provides a robust foundation for automated infrastructure management.