Geir Okkenhaug Jerstad 12fb56f35b some research and loose thoughts

2025-06-20 15:32:34 +02:00

9.4 KiB

Raw Blame History

Staggered Machine Update and Reboot System Research

Overview

Research into implementing an automated system for updating and rebooting all lab machines in a staggered fashion using Nix, cronjobs, and our existing lab tool infrastructure.

Goals

Minimize downtime by updating machines in waves
Ensure system stability with gradual rollouts
Leverage Nix's atomic updates and rollback capabilities
Integrate with existing lab tool for orchestration
Provide monitoring and failure recovery

Architecture Components

1. Update Controller

Central orchestrator running on management node
Maintains machine groups and update schedules
Coordinates staggered execution
Monitors update progress and health

2. Machine Groups

Group 1: Non-critical services (dev environments, testing)
Group 2: Infrastructure services (monitoring, logging)
Group 3: Critical services (databases, core applications)
Group 4: Management nodes (controllers, orchestrators)

3. Nix Integration

Use nixos-rebuild switch for atomic updates
Leverage Nix generations for rollback capability
Update channels/flakes before rebuilding
Validate configuration before applying

4. Lab Tool Integration

Extend lab tool with update management commands
Machine inventory and grouping
Health check integration
Status reporting and logging

Implementation Strategy

Phase 1: Basic Staggered Updates

# Example workflow per machine group
lab update prepare --group=dev
lab update execute --group=dev --wait-for-completion
lab update verify --group=dev
lab update prepare --group=infrastructure
# Continue with next group...

Phase 2: Enhanced Orchestration

Dependency-aware scheduling
Health checks before proceeding to next group
Automatic rollback on failures
Notification system

Phase 3: Advanced Features

Blue-green deployments for critical services
Canary releases
Integration with monitoring systems
Custom update policies per service

Cronjob Design

Master Cron Schedule

# Weekly full system update - Sundays at 2 AM
0 2 * * 0 /home/geir/Home-lab/scripts/staggered-update.sh

# Daily security updates for critical machines
0 3 * * * /home/geir/Home-lab/scripts/security-update.sh --group=critical

# Health check and cleanup
0 1 * * * /home/geir/Home-lab/scripts/update-health-check.sh

Update Script Structure

#!/usr/bin/env bash
# staggered-update.sh

set -euo pipefail

# Configuration
GROUPS=("dev" "infrastructure" "critical" "management")
STAGGER_DELAY=30m
MAX_PARALLEL=3

# Log setup
LOG_DIR="/var/log/lab-updates"
LOG_FILE="$LOG_DIR/update-$(date +%Y%m%d-%H%M%S).log"

exec > >(tee -a "$LOG_FILE") 2>&1

for group in "${GROUPS[@]}"; do
    echo "Starting update for group: $group"
    
    # Pre-update checks
    lab health-check --group="$group" || {
        echo "Health check failed for $group, skipping"
        continue
    }
    
    # Update Nix channels/flakes
    lab update prepare --group="$group"
    
    # Execute updates with parallelism control
    lab update execute --group="$group" --parallel="$MAX_PARALLEL"
    
    # Verify updates
    lab update verify --group="$group" || {
        echo "Verification failed for $group, initiating rollback"
        lab update rollback --group="$group"
        # Send alert
        lab notify --level=error --message="Update failed for $group, rolled back"
        exit 1
    }
    
    echo "Group $group updated successfully, waiting $STAGGER_DELAY"
    sleep "$STAGGER_DELAY"
done

echo "All groups updated successfully"
lab notify --level=info --message="Staggered update completed successfully"

Nix Configuration Management

Centralized Configuration

# /home/geir/Home-lab/nix/update-config.nix
{
  updateGroups = {
    dev = {
      machines = [ "dev-01" "dev-02" "test-env" ];
      updatePolicy = "aggressive";
      maintenanceWindow = "02:00-06:00";
      allowReboot = true;
    };
    
    infrastructure = {
      machines = [ "monitor-01" "log-server" "backup-01" ];
      updatePolicy = "conservative";
      maintenanceWindow = "03:00-05:00";
      allowReboot = true;
      dependencies = [ "dev" ];
    };
    
    critical = {
      machines = [ "db-primary" "web-01" "web-02" ];
      updatePolicy = "manual-approval";
      maintenanceWindow = "04:00-05:00";
      allowReboot = false;  # Requires manual reboot
      dependencies = [ "infrastructure" ];
    };
  };
  
  updateSettings = {
    maxParallel = 3;
    healthCheckTimeout = 300;
    rollbackOnFailure = true;
    notificationChannels = [ "email" "discord" ];
  };
}

Machine-Specific Update Configurations

# On each machine: /etc/nixos/update-config.nix
{
  services.lab-updater = {
    enable = true;
    group = "infrastructure";
    preUpdateScript = ''
      # Stop non-critical services
      systemctl stop some-service
    '';
    postUpdateScript = ''
      # Restart services and verify
      systemctl start some-service
      curl -f http://localhost:8080/health
    '';
    rollbackScript = ''
      # Custom rollback procedures
      systemctl stop some-service
      nixos-rebuild switch --rollback
      systemctl start some-service
    '';
  };
}

Lab Tool Extensions

New Commands

# Update management
lab update prepare [--group=GROUP] [--machine=MACHINE]
lab update execute [--group=GROUP] [--parallel=N] [--dry-run]
lab update verify [--group=GROUP]
lab update rollback [--group=GROUP] [--to-generation=N]
lab update status [--group=GROUP]

# Health and monitoring
lab health-check [--group=GROUP] [--timeout=SECONDS]
lab update-history [--group=GROUP] [--days=N]
lab notify [--level=LEVEL] [--message=MSG] [--channel=CHANNEL]

# Configuration
lab update-config show [--group=GROUP]
lab update-config set [--group=GROUP] [--key=KEY] [--value=VALUE]

Integration Points

# lab/commands/update.py
class UpdateCommand:
    def prepare(self, group=None, machine=None):
        """Prepare machines for updates"""
        # Update Nix channels/flakes
        # Pre-update health checks
        # Download packages
        
    def execute(self, group=None, parallel=1, dry_run=False):
        """Execute updates on machines"""
        # Run nixos-rebuild switch
        # Monitor progress
        # Handle failures
        
    def verify(self, group=None):
        """Verify updates completed successfully"""
        # Check system health
        # Verify services
        # Compare generations

Monitoring and Alerting

Health Checks

Service availability checks
Resource usage monitoring
System log analysis
Network connectivity tests

Alerting Triggers

Update failures
Health check failures
Rollback events
Long-running updates

Notification Channels

Email notifications
Discord/Slack integration
Dashboard updates
Log aggregation

Safety Mechanisms

Pre-Update Validation

Configuration syntax checking
Dependency verification
Resource availability checks
Backup verification

During Update

Progress monitoring
Timeout handling
Partial failure recovery
Emergency stop capability

Post-Update

Service verification
Performance monitoring
Automatic rollback triggers
Success confirmation

Rollback Strategy

Automatic Rollback Triggers

Health check failures
Service startup failures
Critical error detection
Timeout exceeded

Manual Rollback

# Quick rollback to previous generation
lab update rollback --group=critical --immediate

# Rollback to specific generation
lab update rollback --group=infrastructure --to-generation=150

# Selective rollback (specific machines)
lab update rollback --machine=db-primary,web-01

Testing Strategy

Development Environment

Test updates in isolated environment
Validate scripts and configurations
Performance testing
Failure scenario testing

Staging Rollout

Deploy to staging group first
Automated testing suite
Manual verification
Production deployment

Security Considerations

Secure communication channels
Authentication for update commands
Audit logging
Access control for update scripts
Encrypted configuration storage

Future Enhancements

Advanced Scheduling

Maintenance window management
Business hour awareness
Holiday scheduling
Emergency update capabilities

Intelligence Features

Machine learning for optimal timing
Predictive failure detection
Automatic dependency discovery
Performance impact analysis

Integration Expansions

CI/CD pipeline integration
Cloud provider APIs
Container orchestration
Configuration management systems

Implementation Roadmap

Phase 1 (2 weeks)

Basic staggered update script
Simple group management
Nix integration
Basic health checks

Phase 2 (4 weeks)

Lab tool integration
Advanced scheduling
Monitoring and alerting
Rollback mechanisms

Phase 3 (6 weeks)

Advanced features
Performance optimization
Extended integrations
Documentation and training

Resources and References

NixOS Manual: System Administration
Cron Best Practices
Blue-Green Deployment Patterns
Infrastructure as Code Principles
Monitoring and Observability Patterns

Conclusion

A well-designed staggered update system will significantly improve lab maintenance efficiency while reducing risk. The combination of Nix's atomic updates, careful orchestration, and comprehensive monitoring provides a robust foundation for automated infrastructure management.

9.4 KiB Raw Blame History