some research and loose thoughts

This commit is contained in:
Geir Okkenhaug Jerstad 2025-06-20 15:32:34 +02:00
parent 076c38d829
commit 12fb56f35b
7 changed files with 2160 additions and 5 deletions

View file

@ -12,6 +12,7 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
### ✅ **Completed Components**
#### Task Master AI Core
- **Installation**: Claude Task Master AI successfully packaged for NixOS
- **Local Binary**: Available at `/home/geir/Home-lab/result/bin/task-master-ai`
- **Ollama Integration**: Configured to use local models (qwen3:4b, deepseek-r1:1.5b, gemma3:4b-it-qat)
@ -19,8 +20,9 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
- **VS Code Integration**: Configured for Cursor/VS Code with MCP protocol
#### Infrastructure Components
- **NixOS Service Module**: `rag-taskmaster.nix` implemented with full configuration options
- **Active Projects**:
- **Active Projects**:
- Home lab (deploy-rs integration): 90% complete (9/10 tasks done)
- Guile tooling migration: 12% complete (3/25 tasks done)
- **Documentation**: Comprehensive technical documentation in `/research/`
@ -28,23 +30,27 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
### 🔄 **In Progress**
#### RAG System Implementation
- **Status**: Planned but not yet deployed
- **Dependencies**: Need to implement RAG core components
- **Module Ready**: NixOS service module exists but needs RAG implementation
#### MCP Integration for RAG
- **Status**: Bridge architecture designed
- **Requirements**: Need to implement RAG MCP server alongside existing Task Master MCP
### 📋 **Outstanding Requirements**
#### Phase 1-3 Implementation Needed
1. **RAG Foundation** - Core RAG system with document indexing
2. **MCP RAG Server** - Separate MCP server for document queries
3. **Production Deployment** - Deploy services to grey-area server
4. **Cross-Service Integration** - Connect RAG and Task Master systems
### 🎯 **Current Active Focus**
- Deploy-rs integration project (nearly complete)
- Guile home lab tooling migration (early phase)
@ -55,6 +61,7 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
### ✅ **Completed Components**
#### Task Master AI Core
- **Installation**: Claude Task Master AI successfully packaged for NixOS
- **Local Binary**: Available at `/home/geir/Home-lab/result/bin/task-master-ai`
- **Ollama Integration**: Configured to use local models (qwen3:4b, deepseek-r1:1.5b, gemma3:4b-it-qat)
@ -62,8 +69,9 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
- **VS Code Integration**: Configured for Cursor/VS Code with MCP protocol
#### Infrastructure Components
- **NixOS Service Module**: `rag-taskmaster.nix` implemented with full configuration options
- **Active Projects**:
- **Active Projects**:
- Home lab (deploy-rs integration): 90% complete (9/10 tasks done)
- Guile tooling migration: 12% complete (3/25 tasks done)
- **Documentation**: Comprehensive technical documentation in `/research/`
@ -71,23 +79,27 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
### 🔄 **In Progress**
#### RAG System Implementation
- **Status**: Planned but not yet deployed
- **Dependencies**: Need to implement RAG core components
- **Module Ready**: NixOS service module exists but needs RAG implementation
#### MCP Integration for RAG
- **Status**: Bridge architecture designed
- **Requirements**: Need to implement RAG MCP server alongside existing Task Master MCP
### 📋 **Outstanding Requirements**
#### Phase 1-3 Implementation Needed
1. **RAG Foundation** - Core RAG system with document indexing
2. **MCP RAG Server** - Separate MCP server for document queries
3. **Production Deployment** - Deploy services to grey-area server
4. **Cross-Service Integration** - Connect RAG and Task Master systems
### 🎯 **Current Active Focus**
- Deploy-rs integration project (nearly complete)
- Guile home lab tooling migration (early phase)
@ -256,6 +268,7 @@ graph TB
**Status**: NixOS module exists, needs deployment and testing
**Completed Tasks**:
- ✅ NixOS module development (`rag-taskmaster.nix`)
- ✅ Service configuration templates
- ✅ User isolation and security configuration
@ -294,6 +307,7 @@ graph TB
**Status**: Core functionality complete, bridge integration needed
**Completed Tasks**:
- ✅ Task Master installation and packaging
- ✅ Ollama integration configuration
- ✅ MCP server with 25+ tools

View file

@ -0,0 +1,535 @@
# Guile-Based Programmatic Configuration Strategy
## Overview
Research into implementing a robust, programmatic configuration system using Guile's strengths for the lab tool, moving beyond simple YAML to leverage Scheme's expressiveness and composability.
## Why Guile for Configuration?
### Advantages Over YAML
- **Programmable**: Logic, conditionals, functions in configuration
- **Composable**: Reusable configuration snippets and inheritance
- **Type Safety**: Scheme's type system prevents configuration errors
- **Extensible**: Custom DSL capabilities for lab-specific concepts
- **Dynamic**: Runtime configuration generation and validation
- **Functional**: Pure functions for configuration transformation
### Guile-Specific Benefits
- **S-expressions**: Natural data structure representation
- **Modules**: Clean separation of configuration concerns
- **Macros**: Custom syntax for common patterns
- **GOOPS**: Object-oriented configuration when needed
- **Records**: Structured data with validation
## Configuration Architecture
### Hierarchical Structure
```
config/
├── machines/ # Machine-specific configurations
│ ├── sleeper-service.scm
│ ├── grey-area.scm
│ ├── reverse-proxy.scm
│ └── orchestrator.scm
├── groups/ # Machine group definitions
│ ├── infrastructure.scm
│ ├── services.scm
│ └── development.scm
├── environments/ # Environment-specific configs
│ ├── production.scm
│ ├── staging.scm
│ └── development.scm
├── templates/ # Reusable configuration templates
│ ├── web-server.scm
│ ├── database.scm
│ └── monitoring.scm
└── base.scm # Core configuration framework
```
## Implementation Plan
### 1. Configuration Framework Module
```scheme
;; config/base.scm - Core configuration framework
(define-module (config base)
#:use-module (srfi srfi-9) ; Records
#:use-module (srfi srfi-1) ; Lists
#:use-module (ice-9 match)
#:use-module (ice-9 format)
#:export (define-machine
define-group
define-environment
machine?
group?
get-machine-config
get-group-machines
validate-config
merge-configs
resolve-inheritance))
;; Machine record type with validation
(define-record-type <machine>
(make-machine name hostname user services groups environment metadata)
machine?
(name machine-name)
(hostname machine-hostname)
(user machine-user)
(services machine-services)
(groups machine-groups)
(environment machine-environment)
(metadata machine-metadata))
;; Group record type
(define-record-type <group>
(make-group name machines deployment-order dependencies metadata)
group?
(name group-name)
(machines group-machines)
(deployment-order group-deployment-order)
(dependencies group-dependencies)
(metadata group-metadata))
;; Environment record type
(define-record-type <environment>
(make-environment name settings overrides)
environment?
(name environment-name)
(settings environment-settings)
(overrides environment-overrides))
;; Configuration DSL macros
(define-syntax define-machine
(syntax-rules ()
((_ name hostname config ...)
(make-machine 'name hostname
(parse-machine-config (list config ...))))))
(define-syntax define-group
(syntax-rules ()
((_ name machines ...)
(make-group 'name (list machines ...)
(parse-group-config (list machines ...))))))
;; Pure function: Parse machine configuration
(define (parse-machine-config config-list)
"Parse machine configuration from keyword-value pairs"
(let loop ((config config-list)
(result '()))
(match config
(() result)
((#:user user . rest)
(loop rest (cons `(user . ,user) result)))
((#:services services . rest)
(loop rest (cons `(services . ,services) result)))
((#:groups groups . rest)
(loop rest (cons `(groups . ,groups) result)))
((#:environment env . rest)
(loop rest (cons `(environment . ,env) result)))
((key value . rest)
(loop rest (cons `(,key . ,value) result))))))
;; Configuration inheritance resolver
(define (resolve-inheritance machine-config template-configs)
"Resolve configuration inheritance from templates"
(fold merge-configs machine-config template-configs))
;; Configuration merger
(define (merge-configs base-config override-config)
"Merge two configurations, with override taking precedence"
(append override-config
(filter (lambda (item)
(not (assoc (car item) override-config)))
base-config)))
;; Configuration validator
(define (validate-config config)
"Validate configuration completeness and consistency"
(and (assoc 'hostname config)
(assoc 'user config)
(string? (assoc-ref config 'hostname))
(string? (assoc-ref config 'user))))
```
### 2. Machine Configuration Examples
```scheme
;; config/machines/sleeper-service.scm
(define-module (config machines sleeper-service)
#:use-module (config base)
#:use-module (config templates web-server)
#:export (sleeper-service-config))
(define sleeper-service-config
(define-machine sleeper-service "sleeper-service.local"
#:user "root"
#:services '(nginx postgresql redis)
#:groups '(infrastructure database)
#:environment 'production
#:ssh-port 22
#:deploy-strategy 'rolling
#:health-checks '((http "http://localhost:80/health")
(tcp 5432)
(tcp 6379))
#:dependencies '()
#:reboot-delay 0 ; First to reboot
#:backup-required #t
#:monitoring-enabled #t
#:metadata `((description . "Main application server")
(maintainer . "geir")
(criticality . high))))
;; config/machines/grey-area.scm
(define-module (config machines grey-area)
#:use-module (config base)
#:use-module (config templates monitoring)
#:export (grey-area-config))
(define grey-area-config
(define-machine grey-area "grey-area.local"
#:user "root"
#:services '(prometheus grafana alertmanager)
#:groups '(infrastructure monitoring)
#:environment 'production
#:ssh-port 22
#:deploy-strategy 'blue-green
#:health-checks '((http "http://localhost:3000/health")
(http "http://localhost:9090/-/healthy"))
#:dependencies '(sleeper-service)
#:reboot-delay 600 ; 10 minutes after sleeper-service
#:backup-required #f
#:monitoring-enabled #t
#:metadata `((description . "Monitoring and observability")
(maintainer . "geir")
(criticality . medium))))
;; config/machines/reverse-proxy.scm
(define-module (config machines reverse-proxy)
#:use-module (config base)
#:use-module (config templates proxy)
#:export (reverse-proxy-config))
(define reverse-proxy-config
(define-machine reverse-proxy "reverse-proxy.local"
#:user "root"
#:services '(nginx traefik)
#:groups '(infrastructure edge)
#:environment 'production
#:ssh-port 22
#:deploy-strategy 'rolling
#:health-checks '((http "http://localhost:80/health")
(tcp 443))
#:dependencies '(sleeper-service grey-area)
#:reboot-delay 1200 ; 20 minutes after sleeper-service
#:backup-required #f
#:monitoring-enabled #t
#:public-facing #t
#:ssl-certificates '("homelab.local" "*.homelab.local")
#:metadata `((description . "Edge proxy and load balancer")
(maintainer . "geir")
(criticality . high))))
```
### 3. Group Configuration
```scheme
;; config/groups/infrastructure.scm
(define-module (config groups infrastructure)
#:use-module (config base)
#:export (infrastructure-group))
(define infrastructure-group
(define-group infrastructure
#:machines '(sleeper-service grey-area reverse-proxy)
#:deployment-order '(sleeper-service grey-area reverse-proxy)
#:reboot-sequence '((sleeper-service . 0)
(grey-area . 600)
(reverse-proxy . 1200))
#:update-strategy 'staggered
#:rollback-strategy 'reverse-order
#:health-check-required #t
#:maintenance-window '(02:00 . 06:00)
#:notification-channels '(email discord)
#:metadata `((description . "Core infrastructure services")
(owner . "platform-team")
(sla . "99.9%"))))
;; config/groups/services.scm
(define-module (config groups services)
#:use-module (config base)
#:export (services-group))
(define services-group
(define-group services
#:machines '(app-server-01 app-server-02 worker-01)
#:deployment-order '(worker-01 app-server-01 app-server-02)
#:update-strategy 'rolling
#:canary-percentage 25
#:health-check-required #t
#:dependencies '(infrastructure)
#:metadata `((description . "Application services")
(owner . "application-team"))))
```
### 4. Template System
```scheme
;; config/templates/web-server.scm
(define-module (config templates web-server)
#:use-module (config base)
#:export (web-server-template))
(define web-server-template
'((services . (nginx))
(ports . (80 443))
(health-checks . ((http "http://localhost:80/health")))
(deploy-strategy . rolling)
(backup-required . #f)
(monitoring-enabled . #t)
(firewall-rules . ((allow 80 tcp)
(allow 443 tcp)))))
;; config/templates/database.scm
(define-module (config templates database)
#:use-module (config base)
#:export (database-template))
(define database-template
'((services . (postgresql))
(ports . (5432))
(health-checks . ((tcp 5432)
(pg-isready)))
(deploy-strategy . blue-green)
(backup-required . #t)
(backup-schedule . "0 2 * * *")
(monitoring-enabled . #t)
(replication-enabled . #f)
(firewall-rules . ((allow 5432 tcp internal)))))
```
### 5. Configuration Loader Integration
```scheme
;; lab/config-loader.scm - Integration with existing lab tool
(define-module (lab config-loader)
#:use-module (config base)
#:use-module (config machines sleeper-service)
#:use-module (config machines grey-area)
#:use-module (config machines reverse-proxy)
#:use-module (config groups infrastructure)
#:use-module (utils logging)
#:export (load-lab-config
get-all-machines
get-machine-info
get-reboot-sequence
get-deployment-order))
;; Global configuration registry
(define *lab-config*
`((machines . ,(list sleeper-service-config
grey-area-config
reverse-proxy-config))
(groups . ,(list infrastructure-group))
(environments . ())))
;; Pure function: Get all machine configurations
(define (get-all-machines-from-config)
"Get all machine configurations"
(assoc-ref *lab-config* 'machines))
;; Pure function: Find machine by name
(define (find-machine-by-name name machines)
"Find machine configuration by name"
(find (lambda (machine)
(eq? (machine-name machine) name))
machines))
;; Integration function: Get machine info for existing lab tool
(define (get-machine-info machine-name)
"Get machine information in format expected by existing lab tool"
(let* ((machines (get-all-machines-from-config))
(machine (find-machine-by-name machine-name machines)))
(if machine
`((hostname . ,(machine-hostname machine))
(user . ,(machine-user machine))
(ssh-port . ,(assoc-ref (machine-metadata machine) 'ssh-port))
(is-local . ,(string=? (machine-hostname machine) "localhost")))
#f)))
;; Get reboot sequence for orchestrator
(define (get-reboot-sequence)
"Get the ordered reboot sequence with delays"
(let ((infra-group (car (assoc-ref *lab-config* 'groups))))
(assoc-ref (group-metadata infra-group) 'reboot-sequence)))
;; Get deployment order
(define (get-deployment-order group-name)
"Get deployment order for a group"
(let* ((groups (assoc-ref *lab-config* 'groups))
(group (find (lambda (g) (eq? (group-name g) group-name)) groups)))
(if group
(group-deployment-order group)
'())))
```
### 6. Integration with Existing Lab Tool
```scheme
;; Update lab/machines.scm to use new config system
(define-module (lab machines)
#:use-module (lab config-loader)
;; ...existing modules...
#:export (;; ...existing exports...
get-ssh-config
validate-machine-name
list-machines))
;; Update existing functions to use new config system
(define (get-ssh-config machine-name)
"Get SSH configuration for machine - updated to use new config"
(get-machine-info machine-name))
(define (validate-machine-name machine-name)
"Validate machine name exists in configuration"
(let ((machine-info (get-machine-info machine-name)))
(not (eq? machine-info #f))))
(define (list-machines)
"List all configured machines"
(map machine-name (get-all-machines-from-config)))
;; New function: Get machines by group
(define (get-machines-in-group group-name)
"Get all machines in a specific group"
(let ((deployment-order (get-deployment-order group-name)))
(if deployment-order
deployment-order
'())))
```
## Advanced Configuration Features
### 1. Environment-Specific Overrides
```scheme
;; config/environments/production.scm
(define production-environment
(make-environment 'production
;; Base settings
'((log-level . info)
(debug-mode . #f)
(monitoring-enabled . #t)
(backup-enabled . #t))
;; Machine-specific overrides
'((sleeper-service . ((log-level . warn)
(max-connections . 1000)))
(grey-area . ((retention-days . 90))))))
;; config/environments/development.scm
(define development-environment
(make-environment 'development
'((log-level . debug)
(debug-mode . #t)
(monitoring-enabled . #f)
(backup-enabled . #f))
'()))
```
### 2. Dynamic Configuration Generation
```scheme
;; config/generators/auto-scaling.scm
(define (generate-web-server-configs count)
"Dynamically generate web server configurations"
(map (lambda (i)
(define-machine (string->symbol (format #f "web-~2,'0d" i))
(format #f "web-~2,'0d.local" i)
#:user "root"
#:services '(nginx)
#:groups '(web-servers)
#:template web-server-template))
(iota count 1)))
;; Usage in configuration
(define web-servers (generate-web-server-configs 3))
```
### 3. Configuration Validation
```scheme
;; config/validation.scm
(define-module (config validation)
#:use-module (config base)
#:export (validate-lab-config
check-dependencies
validate-network-topology))
(define (validate-lab-config config)
"Comprehensive configuration validation"
(and (validate-machine-configs config)
(validate-group-dependencies config)
(validate-network-topology config)
(validate-reboot-sequence config)))
(define (validate-machine-configs config)
"Validate all machine configurations"
(every validate-config
(map machine-metadata
(assoc-ref config 'machines))))
(define (validate-reboot-sequence config)
"Validate reboot sequence dependencies"
(let ((sequence (get-reboot-sequence)))
(check-dependency-order sequence)))
```
## Migration Strategy
### Phase 1: Parallel Configuration System
1. Implement new config modules alongside existing YAML
2. Add config-loader integration layer
3. Update lab tool to optionally use new system
4. Validate equivalent behavior
### Phase 2: Feature Enhancement
1. Add dynamic configuration capabilities
2. Implement validation and error checking
3. Add environment-specific overrides
4. Enhance orchestrator with new features
### Phase 3: Full Migration
1. Migrate all existing configurations
2. Remove YAML dependency
3. Add advanced features (templates, inheritance)
4. Optimize performance
## Benefits of This Approach
### Developer Experience
- **Rich Configuration**: Logic and computation in config
- **Type Safety**: Catch errors at config load time
- **Reusability**: Templates and inheritance reduce duplication
- **Composability**: Mix and match configuration components
- **Validation**: Comprehensive consistency checking
### Operational Benefits
- **Dynamic Scaling**: Generate configurations programmatically
- **Environment Management**: Seamless dev/staging/prod handling
- **Dependency Tracking**: Automatic dependency resolution
- **Extensibility**: Easy to add new machine types and features
### Integration Advantages
- **Native Guile**: No external dependencies or parsers
- **Performance**: Compiled configuration, fast access
- **Debugging**: Full Guile debugging tools available
- **Flexibility**: Can mix declarative and imperative approaches
## File Structure Summary
```
/home/geir/Home-lab/
├── packages/lab-tool/
│ ├── config/
│ │ ├── base.scm # Configuration framework
│ │ ├── machines/ # Machine definitions
│ │ ├── groups/ # Group definitions
│ │ ├── environments/ # Environment configs
│ │ ├── templates/ # Reusable templates
│ │ └── validation.scm # Configuration validation
│ ├── lab/
│ │ ├── config-loader.scm # Integration layer
│ │ ├── machines.scm # Updated machine management
│ │ └── ...existing modules...
│ └── main.scm # Updated main entry point
```
This approach leverages Guile's strengths to create a powerful, flexible configuration system that grows with your homelab while maintaining the K.I.S.S principles of your current tool.

View file

@ -0,0 +1,455 @@
# Lab-Wide Auto-Update Service with Staggered Reboots
## Overview
A NixOS service that runs on this machine (orchestrator) to update the entire homelab using existing lab tool commands, then perform staggered reboots to ensure you wake up to a freshly updated lab every morning.
## Service Architecture
### Central Orchestrator Approach
- Runs on this machine (the controller)
- Uses existing `lab update` and `lab deploy-all` commands
- Orchestrates staggered reboots: sleeper-service → grey-area → reverse-proxy → self
- 10-minute delays between each machine reboot
## Implementation
### 1. Nix Service Module
```nix
# /home/geir/Home-lab/nix/modules/lab-orchestrator.nix
{ config, lib, pkgs, ... }:
with lib;
let
cfg = config.services.lab-orchestrator;
labPath = "/home/geir/Home-lab";
# Machine reboot order with delays
rebootSequence = [
{ machine = "sleeper-service"; delay = 0; }
{ machine = "grey-area"; delay = 600; } # 10 minutes
{ machine = "reverse-proxy"; delay = 1200; } # 20 minutes total
{ machine = "self"; delay = 1800; } # 30 minutes total
];
orchestratorScript = pkgs.writeShellScript "lab-orchestrator" ''
#!/usr/bin/env bash
set -euo pipefail
LOG_FILE="/var/log/lab-orchestrator.log"
LAB_TOOL="${labPath}/result/bin/lab"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S'): $1" | tee -a "$LOG_FILE"
}
# Ensure lab tool is available
if [[ ! -x "$LAB_TOOL" ]]; then
log "ERROR: Lab tool not found at $LAB_TOOL"
log "Building lab tool first..."
cd "${labPath}"
if ! nix build .#lab-tool; then
log "ERROR: Failed to build lab tool"
exit 1
fi
fi
log "=== Starting Lab-Wide Update Orchestration ==="
# Step 1: Update flake inputs
log "Updating flake inputs..."
cd "${labPath}"
if ! $LAB_TOOL update; then
log "ERROR: Failed to update flake inputs"
exit 1
fi
log "Flake inputs updated successfully"
# Step 2: Deploy to all machines
log "Deploying to all machines..."
if ! $LAB_TOOL deploy-all; then
log "ERROR: Failed to deploy to all machines"
exit 1
fi
log "Deployment completed successfully"
# Step 3: Staggered reboots
log "Starting staggered reboot sequence..."
# Reboot sleeper-service immediately
log "Rebooting sleeper-service..."
if $LAB_TOOL reboot sleeper-service; then
log "✓ sleeper-service reboot initiated"
else
log "WARNING: Failed to reboot sleeper-service"
fi
# Wait 10 minutes, then reboot grey-area
log "Waiting 10 minutes before rebooting grey-area..."
sleep 600
log "Rebooting grey-area..."
if $LAB_TOOL reboot grey-area; then
log "✓ grey-area reboot initiated"
else
log "WARNING: Failed to reboot grey-area"
fi
# Wait 10 minutes, then reboot reverse-proxy
log "Waiting 10 minutes before rebooting reverse-proxy..."
sleep 600
log "Rebooting reverse-proxy..."
if $LAB_TOOL reboot reverse-proxy; then
log "✓ reverse-proxy reboot initiated"
else
log "WARNING: Failed to reboot reverse-proxy"
fi
# Wait 10 minutes, then reboot self
log "Waiting 10 minutes before rebooting self..."
sleep 600
log "Rebooting this machine (orchestrator)..."
log "=== Lab Update Orchestration Completed ==="
# Reboot this machine
systemctl reboot
'';
in
{
options.services.lab-orchestrator = {
enable = mkEnableOption "Lab orchestrator auto-update service";
schedule = mkOption {
type = types.str;
default = "02:00";
description = "Time to start lab update (HH:MM format)";
};
user = mkOption {
type = types.str;
default = "geir";
description = "User to run the lab tool as";
};
};
config = mkIf cfg.enable {
systemd.services.lab-orchestrator = {
description = "Lab-Wide Update Orchestrator";
serviceConfig = {
Type = "oneshot";
User = cfg.user;
Group = "users";
WorkingDirectory = labPath;
ExecStart = "${orchestratorScript}";
# Give it plenty of time (2 hours)
TimeoutStartSec = 7200;
};
# Ensure network is ready
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
};
systemd.timers.lab-orchestrator = {
description = "Lab-Wide Update Orchestrator Timer";
timerConfig = {
OnCalendar = "*-*-* ${cfg.schedule}:00";
Persistent = true;
# No randomization - we want predictable timing
};
wantedBy = [ "timers.target" ];
};
# Ensure log directory and file exist with proper permissions
systemd.tmpfiles.rules = [
"f /var/log/lab-orchestrator.log 0644 ${cfg.user} users -"
];
};
}
```
### 2. Lab Tool Reboot Command Extension
Add reboot capability to the existing Guile lab tool:
```scheme
;; lab/reboot.scm - New module for machine reboots
(define-module (lab reboot)
#:use-module (ice-9 format)
#:use-module (ice-9 popen)
#:use-module (utils logging)
#:use-module (lab machines)
#:export (reboot-machine))
(define (execute-ssh-command hostname command)
"Execute command on remote machine via SSH"
(let* ((ssh-cmd (format #f "ssh root@~a '~a'" hostname command))
(port (open-input-pipe ssh-cmd))
(output (read-string port)))
(close-pipe port)
output))
(define (reboot-machine machine-name)
"Reboot a specific machine via SSH"
(log-info "Attempting to reboot machine: ~a" machine-name)
(if (validate-machine-name machine-name)
(let* ((ssh-config (get-ssh-config machine-name))
(hostname (if ssh-config
(assoc-ref ssh-config 'hostname)
machine-name))
(is-local (if ssh-config
(assoc-ref ssh-config 'is-local)
#f)))
(cond
(is-local
(log-info "Rebooting local machine...")
(system "sudo systemctl reboot")
#t)
(hostname
(log-info "Rebooting ~a via SSH..." hostname)
(catch #t
(lambda ()
;; Send reboot command - connection will drop
(execute-ssh-command hostname "sudo systemctl reboot")
(log-success "Reboot command sent to ~a" machine-name)
#t)
(lambda (key . args)
;; SSH connection drop is expected during reboot
(if (string-contains (format #f "~a" args) "Connection")
(begin
(log-info "Connection dropped (expected during reboot)")
#t)
(begin
(log-error "Failed to reboot ~a: ~a" machine-name args)
#f)))))
(else
(log-error "No hostname found for machine: ~a" machine-name)
#f)))
(begin
(log-error "Invalid machine name: ~a" machine-name)
#f)))
```
### 3. CLI Integration
Update the main.scm dispatcher to include reboot command:
```scheme
;; main.scm (additions to command dispatcher)
(use-modules ;; ...existing modules...
(lab reboot))
;; Add to dispatch-command function
(define (dispatch-command command args)
"Dispatch command with appropriate handler"
(match command
;; ...existing cases...
('reboot
(if (null? args)
(begin
(log-error "reboot command requires machine name")
(format #t "Usage: lab reboot <machine>\n"))
(let ((result (reboot-machine (car args))))
(if result
(log-success "Reboot initiated")
(log-error "Reboot failed")))))
;; ...rest of existing cases...
))
;; Update help text to include reboot command
(define (get-help-text)
"Pure function returning help text"
"Home Lab Tool - K.I.S.S Refactored Edition
USAGE: lab <command> [args...]
COMMANDS:
status Show infrastructure status
machines List all machines
deploy <machine> Deploy configuration to machine
deploy-all Deploy to all machines
update Update flake inputs
health [machine] Check machine health (all if no machine specified)
ssh <machine> SSH to machine
reboot <machine> Reboot machine via SSH
test-modules Test modular implementation
help Show this help
EXAMPLES:
lab status
lab machines
lab deploy congenital-optimist
lab deploy-all
lab update
lab health
lab health sleeper-service
lab ssh sleeper-service
lab reboot sleeper-service
lab test-modules
")
### 4. Configuration
Enable the service on this machine (the orchestrator):
```nix
# hosts/this-machine/configuration.nix
{
imports = [
../../nix/modules/lab-orchestrator.nix
];
services.lab-orchestrator = {
enable = true;
schedule = "02:00"; # 2 AM start
user = "geir";
};
}
```
## Timeline Breakdown
### Nightly Execution (Starting 2:00 AM)
```
02:00 - Start orchestration
02:00-02:15 - Update flake inputs (lab update)
02:15-02:45 - Deploy to all machines (lab deploy-all)
02:45 - Reboot sleeper-service
02:55 - Reboot grey-area (10 min later)
03:05 - Reboot reverse-proxy (10 min later)
03:15 - Reboot orchestrator machine (10 min later)
03:20 - All machines back online and updated
```
### Total Duration: ~1 hour 20 minutes
- Deployment: ~30 minutes
- Staggered reboots: ~50 minutes
- Everything done by 3:20 AM
## Safety Features
### Logging and Monitoring
```bash
# Check orchestrator logs
sudo journalctl -u lab-orchestrator.service -f
# Check orchestrator log file
tail -f /var/log/lab-orchestrator.log
# Check timer status
systemctl status lab-orchestrator.timer
```
### Manual Controls
```bash
# Start update manually
sudo systemctl start lab-orchestrator.service
# Disable automatic updates
sudo systemctl disable lab-orchestrator.timer
# Check when next run is scheduled
systemctl list-timers lab-orchestrator.timer
```
### Recovery Options
```bash
# If orchestration fails, machines can be individually managed
lab deploy sleeper-service
lab deploy grey-area
lab deploy reverse-proxy
# Emergency reboot sequence
lab reboot sleeper-service
sleep 600
lab reboot grey-area
sleep 600
lab reboot reverse-proxy
```
## Machine Configuration Requirements
### SSH Key Setup
Ensure this machine can SSH to all target machines:
```bash
# Test connectivity
ssh root@sleeper-service "echo 'Connection OK'"
ssh root@grey-area "echo 'Connection OK'"
ssh root@reverse-proxy "echo 'Connection OK'"
```
### Lab Tool Configuration
Ensure lab.yaml includes all machines:
```yaml
machines:
sleeper-service:
host: sleeper-service.local
user: root
grey-area:
host: grey-area.local
user: root
reverse-proxy:
host: reverse-proxy.local
user: root
```
## Deployment Steps
### 1. Create the Service Module
Add the Nix module file and import it
### 2. Extend Lab Tool
Add reboot command functionality
### 3. Test Components
```bash
# Build the lab tool first
cd /home/geir/Home-lab
nix build .#lab-tool
# Test lab commands work
./result/bin/lab update
./result/bin/lab deploy-all
./result/bin/lab machines
./result/bin/lab reboot sleeper-service # Test reboot (be careful!)
```
### 4. Enable Service
```bash
# Add to configuration and rebuild
nixos-rebuild switch
# Verify timer is active
systemctl status lab-orchestrator.timer
```
### 5. Monitor First Run
```bash
# Watch the logs during first execution
sudo journalctl -u lab-orchestrator.service -f
```
## Benefits
### Morning Routine
- Wake up to fully updated homelab
- All services running latest versions
- No manual intervention needed
- Predictable update schedule
### Reliability
- Uses existing, tested lab tool commands
- Proper error handling and logging
- Graceful degradation if individual reboots fail
- Easy to disable or modify timing
### Visibility
- Comprehensive logging of entire process
- Clear timestamps for each phase
- Easy troubleshooting if issues occur
This gives you the "wake up to fresh lab" experience with minimal complexity, leveraging your existing infrastructure!

View file

@ -325,6 +325,7 @@ Netdata provides a comprehensive REST API that makes it perfect for integrating
**Base URL**: `http://localhost:19999/api/v1/`
**Primary Endpoints**:
- `/api/v1/data` - Query time-series data
- `/api/v1/charts` - Get available charts
- `/api/v1/allmetrics` - Get all metrics in shell-friendly format
@ -354,6 +355,7 @@ Netdata provides a comprehensive REST API that makes it perfect for integrating
### API Query Examples
#### Basic Data Query
```bash
# Get CPU system data for the last 60 seconds
curl "http://localhost:19999/api/v1/data?chart=system.cpu&after=-60&dimensions=system"
@ -386,6 +388,7 @@ curl "http://localhost:19999/api/v1/data?chart=system.cpu&after=-60&dimensions=s
```
#### Available Charts Discovery
```bash
# Get all available charts
curl "http://localhost:19999/api/v1/charts"
@ -398,18 +401,21 @@ curl "http://localhost:19999/api/v1/charts"
```
#### Memory Usage Example
```bash
# Get memory usage data with specific grouping
curl "http://localhost:19999/api/v1/data?chart=system.ram&after=-300&points=60&group=average"
```
#### Network Interface Metrics
```bash
# Get network traffic for specific interface
curl "http://localhost:19999/api/v1/data?chart=net.eth0&after=-60&dimensions=received,sent"
```
#### All Metrics in Shell Format
```bash
# Perfect for scripting and automation
curl "http://localhost:19999/api/v1/allmetrics"
@ -438,6 +444,7 @@ NETDATA_SYSTEM_RAM_USED=4096
### Web Dashboard Integration Strategies
#### 1. Direct AJAX Calls
```javascript
// Fetch CPU data for dashboard widget
fetch('http://localhost:19999/api/v1/data?chart=system.cpu&after=-60&points=60')
@ -449,6 +456,7 @@ fetch('http://localhost:19999/api/v1/data?chart=system.cpu&after=-60&points=60')
```
#### 2. Server-Side Proxy
```javascript
// Proxy through your web server to avoid CORS issues
fetch('/api/netdata/system.cpu?after=-60')
@ -457,6 +465,7 @@ fetch('/api/netdata/system.cpu?after=-60')
```
#### 3. Real-Time Updates
```javascript
// Poll for updates every second
setInterval(() => {
@ -537,23 +546,27 @@ setInterval(() => {
### Integration Considerations
#### 1. **CORS Handling**
- Netdata allows cross-origin requests by default
- For production, consider proxying through your web server
- Use server-side API calls for sensitive environments
#### 2. **Performance Optimization**
- Cache frequently accessed chart definitions
- Use appropriate `points` parameter to limit data transfer
- Implement efficient polling strategies
- Consider WebSocket connections for real-time updates
#### 3. **Data Processing**
- Netdata returns timestamps and values as arrays
- Convert to your chart library's expected format
- Handle missing data points gracefully
- Implement data aggregation for longer time ranges
#### 4. **Error Handling**
```javascript
async function safeNetdataFetch(endpoint) {
try {
@ -589,9 +602,9 @@ class MultiNodeDashboard {
### API Documentation Resources
- **Swagger Documentation**: https://learn.netdata.cloud/api
- **OpenAPI Spec**: https://raw.githubusercontent.com/netdata/netdata/master/src/web/api/netdata-swagger.yaml
- **Query Documentation**: https://learn.netdata.cloud/docs/developer-and-contributor-corner/rest-api/queries/
- **Swagger Documentation**: <https://learn.netdata.cloud/api>
- **OpenAPI Spec**: <https://raw.githubusercontent.com/netdata/netdata/master/src/web/api/netdata-swagger.yaml>
- **Query Documentation**: <https://learn.netdata.cloud/docs/developer-and-contributor-corner/rest-api/queries/>
### Conclusion

View file

@ -0,0 +1,382 @@
# Leveraging NixOS Configuration vs Custom Implementation
## Current Situation Analysis
We're at risk of reimplementing significant functionality that NixOS already provides:
### What NixOS Already Handles
- **Machine Configuration**: Complete system configuration as code
- **Service Management**: Declarative service definitions
- **Deployment**: `nixos-rebuild` with atomic updates
- **Validation**: Configuration validation at build time
- **Dependencies**: Service dependency management
- **Environments**: Multiple configurations per machine
- **Templates**: NixOS modules for reusable configuration
- **Type Safety**: Nix language type system
- **Inheritance**: Module imports and overrides
### What We're Duplicating
- Machine metadata and properties
- Service definitions and health checks
- Deployment strategies and validation
- Configuration inheritance and composition
- Environment-specific overrides
## Better Approach: NixOS-Native Strategy
### Core Principle
**Let NixOS handle configuration, let lab tool handle orchestration**
### Revised Architecture
#### 1. NixOS Handles Configuration
```nix
# hosts/sleeper-service/configuration.nix
{ config, lib, pkgs, ... }:
{
# NixOS handles all the configuration
services.nginx.enable = true;
services.postgresql.enable = true;
# Lab-specific metadata as NixOS options
lab.machine = {
role = "application-server";
groups = [ "infrastructure" "database" ];
rebootOrder = 1;
dependencies = [ ];
healthChecks = [
{ type = "http"; url = "http://localhost:80/health"; }
{ type = "tcp"; port = 5432; }
];
orchestration = {
deployStrategy = "rolling";
rebootDelay = 0;
criticalityLevel = "high";
};
};
}
```
#### 2. Lab Tool Handles Orchestration
```scheme
;; lab tool queries NixOS configuration, doesn't define it
(define (get-machine-metadata machine-name)
"Extract lab metadata from NixOS configuration"
(let ((config-path (format #f "hosts/~a/configuration.nix" machine-name)))
(extract-lab-metadata-from-nix-config config-path)))
(define (get-reboot-sequence)
"Get reboot sequence from NixOS configurations"
(let ((machines (get-all-machines)))
(sort machines
(lambda (a b)
(< (get-reboot-order a) (get-reboot-order b))))))
```
### Implementation Strategy
#### 1. Create NixOS Lab Module
```nix
# nix/modules/lab-machine.nix
{ config, lib, pkgs, ... }:
with lib;
let
cfg = config.lab.machine;
in
{
options.lab.machine = {
role = mkOption {
type = types.str;
description = "Machine role in the lab";
example = "web-server";
};
groups = mkOption {
type = types.listOf types.str;
default = [];
description = "Groups this machine belongs to";
};
rebootOrder = mkOption {
type = types.int;
description = "Order in reboot sequence (lower = earlier)";
};
dependencies = mkOption {
type = types.listOf types.str;
default = [];
description = "Machines this depends on";
};
healthChecks = mkOption {
type = types.listOf (types.submodule {
options = {
type = mkOption {
type = types.enum [ "http" "tcp" "command" ];
description = "Type of health check";
};
url = mkOption {
type = types.nullOr types.str;
default = null;
description = "URL for HTTP health checks";
};
port = mkOption {
type = types.nullOr types.int;
default = null;
description = "Port for TCP health checks";
};
command = mkOption {
type = types.nullOr types.str;
default = null;
description = "Command for command-based health checks";
};
};
});
default = [];
description = "Health check configurations";
};
orchestration = mkOption {
type = types.submodule {
options = {
deployStrategy = mkOption {
type = types.enum [ "rolling" "blue-green" "recreate" ];
default = "rolling";
description = "Deployment strategy";
};
rebootDelay = mkOption {
type = types.int;
default = 600; # 10 minutes
description = "Delay in seconds before this machine reboots";
};
criticalityLevel = mkOption {
type = types.enum [ "low" "medium" "high" "critical" ];
default = "medium";
description = "Service criticality level";
};
};
};
default = {};
description = "Orchestration configuration";
};
};
config = {
# Generate machine metadata file for lab tool consumption
environment.etc."lab-machine-metadata.json".text = builtins.toJSON {
inherit (cfg) role groups rebootOrder dependencies healthChecks orchestration;
hostname = config.networking.hostName;
services = builtins.attrNames (lib.filterAttrs (n: v: v.enable or false) config.services);
};
};
}
```
#### 2. Simplified Lab Tool
```scheme
;; lab/nix-integration.scm - NixOS integration module
(define-module (lab nix-integration)
#:use-module (ice-9 format)
#:use-module (ice-9 popen)
#:use-module (json)
#:export (get-machine-metadata-from-nix
get-all-nix-machines
get-reboot-sequence-from-nix
build-nix-config
evaluate-nix-expr))
(define (evaluate-nix-expr expr)
"Evaluate a Nix expression and return the result"
(let* ((cmd (format #f "nix eval --json --expr '~a'" expr))
(port (open-input-pipe cmd))
(output (read-string port)))
(close-pipe port)
(if (string-null? output)
#f
(json-string->scm output))))
(define (get-machine-metadata-from-nix machine-name)
"Get machine metadata from NixOS configuration"
(let* ((expr (format #f
"(import ./hosts/~a/configuration.nix {}).lab.machine // { hostname = \"~a\"; }"
machine-name machine-name))
(metadata (evaluate-nix-expr expr)))
metadata))
(define (get-all-nix-machines)
"Get all machines by scanning hosts directory"
(let* ((hosts-expr "(builtins.attrNames (builtins.readDir ./hosts))")
(hosts (evaluate-nix-expr hosts-expr)))
(if hosts hosts '())))
(define (get-reboot-sequence-from-nix)
"Get reboot sequence from NixOS configurations"
(let* ((machines (get-all-nix-machines))
(machine-data (map (lambda (machine)
(cons machine (get-machine-metadata-from-nix machine)))
machines)))
(sort machine-data
(lambda (a b)
(< (assoc-ref (cdr a) 'rebootOrder)
(assoc-ref (cdr b) 'rebootOrder))))))
```
#### 3. Updated Machine Configurations
```nix
# hosts/sleeper-service/configuration.nix
{ config, lib, pkgs, ... }:
{
imports = [
../../nix/modules/lab-machine.nix
# ... other imports
];
# Standard NixOS configuration
services.nginx = {
enable = true;
# ... nginx config
};
services.postgresql = {
enable = true;
# ... postgresql config
};
# Lab orchestration metadata
lab.machine = {
role = "application-server";
groups = [ "infrastructure" "backend" ];
rebootOrder = 1;
dependencies = [ ];
healthChecks = [
{
type = "http";
url = "http://localhost:80/health";
}
{
type = "tcp";
port = 5432;
}
];
orchestration = {
deployStrategy = "rolling";
rebootDelay = 0;
criticalityLevel = "high";
};
};
}
```
#### 4. Lab Tool Integration
```scheme
;; Update main.scm to use NixOS integration
(use-modules ;; ...existing modules...
(lab nix-integration))
(define (cmd-machines)
"List all configured machines from NixOS"
(log-info "Listing machines from NixOS configurations...")
(let ((machines (get-all-nix-machines)))
(format #t "Configured Machines (from NixOS):\n")
(for-each (lambda (machine)
(let ((metadata (get-machine-metadata-from-nix machine)))
(format #t " ~a (~a) - ~a\n"
machine
(assoc-ref metadata 'role)
(string-join (assoc-ref metadata 'groups) ", "))))
machines)))
(define (cmd-orchestrator-sequence)
"Show the orchestrated reboot sequence"
(log-info "Getting reboot sequence from NixOS configurations...")
(let ((sequence (get-reboot-sequence-from-nix)))
(format #t "Reboot Sequence:\n")
(for-each (lambda (machine-data)
(let ((machine (car machine-data))
(metadata (cdr machine-data)))
(format #t " ~a. ~a (delay: ~a seconds)\n"
(assoc-ref metadata 'rebootOrder)
machine
(assoc-ref metadata 'orchestration 'rebootDelay))))
sequence)))
```
### Benefits of This Approach
#### 1. Leverage NixOS Strengths
- **Configuration Management**: NixOS handles all system configuration
- **Validation**: Nix language validates configuration at build time
- **Atomic Updates**: `nixos-rebuild` provides atomic system updates
- **Rollbacks**: Nix generations for automatic rollback
- **Reproducibility**: Identical configurations across environments
#### 2. Lab Tool Focus
- **Orchestration**: Coordinate updates across multiple machines
- **Sequencing**: Handle reboot ordering and dependencies
- **Monitoring**: Health checks and status reporting
- **Communication**: SSH coordination and logging
#### 3. Reduced Complexity
- **No Duplication**: Don't reimplement what NixOS provides
- **Native Integration**: Work with NixOS's natural patterns
- **Maintainability**: Less custom code to maintain
- **Ecosystem**: Leverage existing NixOS modules and community
### Migration Strategy
#### Phase 1: Add Lab Module to NixOS
1. Create `lab-machine.nix` module
2. Add to each machine configuration
3. Test metadata extraction
#### Phase 2: Update Lab Tool
1. Replace custom config with NixOS integration
2. Update commands to read from NixOS configs
3. Test orchestration with new metadata
#### Phase 3: Enhanced Features
1. Add more sophisticated orchestration
2. Integrate with NixOS deployment tools
3. Add monitoring and alerting
### Example: Simplified Orchestrator
```nix
# The orchestrator service becomes much simpler
systemd.services.lab-orchestrator = {
script = ''
# Update flake
nix flake update
# Get reboot sequence from NixOS configs
SEQUENCE=$(lab get-reboot-sequence)
# Deploy to all machines
lab deploy-all
# Execute reboot sequence
for machine_delay in $SEQUENCE; do
machine=$(echo $machine_delay | cut -d: -f1)
delay=$(echo $machine_delay | cut -d: -f2)
sleep $delay
lab reboot $machine
done
'';
};
```
## Conclusion
By leveraging NixOS's existing configuration system instead of reinventing it, we get:
- **Less code to maintain**
- **Better integration with the Nix ecosystem**
- **Validation and type safety from Nix**
- **Standard NixOS patterns and practices**
- **Focus on actual orchestration needs**
The lab tool becomes a **coordination layer** rather than a configuration management system, which is exactly what you need for homelab orchestration.

View file

@ -0,0 +1,354 @@
# Simple Lab Auto-Update Service Plan
## Overview
A simple automated update service for the homelab that runs nightly via cron, updates Nix flakes, rebuilds systems, and reboots machines. Designed for homelab environments where uptime is only critical during day hours.
## Current Lab Tool Analysis
Based on the existing lab tool structure, we need to integrate with:
- Command structure and CLI interface
- Machine inventory and management
- Configuration handling
- Logging and status reporting
## Simple Architecture
### Core Components
1. **Nix Service Module** - NixOS service definition for the auto-updater
2. **Lab Tool Integration** - New commands in the existing lab tool
3. **Cron Scheduling** - Simple nightly execution
4. **Update Script** - Core logic for update/reboot cycle
## Implementation Plan
### 1. Nix Service Module
Create a NixOS service that integrates with the lab tool:
```nix
# /home/geir/Home-lab/nix/modules/lab-auto-update.nix
{ config, lib, pkgs, ... }:
with lib;
let
cfg = config.services.lab-auto-update;
labTool = pkgs.writeShellScript "lab-auto-update" ''
#!/usr/bin/env bash
set -euo pipefail
LOG_FILE="/var/log/lab-auto-update.log"
echo "$(date): Starting auto-update" >> "$LOG_FILE"
# Update flake
lab update-system --self 2>&1 | tee -a "$LOG_FILE"
# Reboot if configured
if [[ "${cfg.autoReboot}" == "true" ]]; then
echo "$(date): Rebooting system" >> "$LOG_FILE"
systemctl reboot
fi
'';
in
{
options.services.lab-auto-update = {
enable = mkEnableOption "Lab auto-update service";
schedule = mkOption {
type = types.str;
default = "02:00";
description = "Time to run updates (HH:MM format)";
};
autoReboot = mkOption {
type = types.bool;
default = true;
description = "Whether to automatically reboot after updates";
};
flakePath = mkOption {
type = types.str;
default = "/home/geir/Home-lab";
description = "Path to the lab flake";
};
};
config = mkIf cfg.enable {
systemd.services.lab-auto-update = {
description = "Lab Auto-Update Service";
serviceConfig = {
Type = "oneshot";
User = "root";
ExecStart = "${labTool}";
};
};
systemd.timers.lab-auto-update = {
description = "Lab Auto-Update Timer";
timerConfig = {
OnCalendar = "daily";
Persistent = true;
RandomizedDelaySec = "30m";
};
wantedBy = [ "timers.target" ];
};
# Ensure log directory exists
systemd.tmpfiles.rules = [
"d /var/log 0755 root root -"
];
};
}
```
### 2. Lab Tool Commands
Add new commands to the existing lab tool:
```python
# lab/commands/update_system.py
class UpdateSystemCommand:
def __init__(self, lab_config):
self.lab_config = lab_config
self.flake_path = lab_config.get('flake_path', '/home/geir/Home-lab')
def update_self(self):
"""Update the current system using Nix flake"""
try:
# Update flake inputs
self._run_command(['nix', 'flake', 'update'], cwd=self.flake_path)
# Rebuild system
hostname = self._get_hostname()
self._run_command([
'nixos-rebuild', 'switch',
'--flake', f'{self.flake_path}#{hostname}'
])
print("System updated successfully")
return True
except Exception as e:
print(f"Update failed: {e}")
return False
def schedule_reboot(self, delay_minutes=1):
"""Schedule a system reboot"""
self._run_command(['shutdown', '-r', f'+{delay_minutes}'])
def _get_hostname(self):
import socket
return socket.gethostname()
def _run_command(self, cmd, cwd=None):
import subprocess
result = subprocess.run(cmd, cwd=cwd, check=True,
capture_output=True, text=True)
return result.stdout
```
### 3. CLI Integration
Extend the main lab tool CLI:
```python
# lab/cli.py (additions)
@cli.group()
def update():
"""System update commands"""
pass
@update.command('system')
@click.option('--self', 'update_self', is_flag=True,
help='Update the current system')
@click.option('--reboot', is_flag=True,
help='Reboot after update')
def update_system(update_self, reboot):
"""Update system using Nix flake"""
if update_self:
updater = UpdateSystemCommand(config)
success = updater.update_self()
if success and reboot:
updater.schedule_reboot()
```
### 4. Simple Configuration
Add update settings to lab configuration:
```yaml
# lab.yaml (additions)
auto_update:
enabled: true
schedule: "02:00"
auto_reboot: true
flake_path: "/home/geir/Home-lab"
log_retention_days: 30
```
## Deployment Strategy
### Per-Machine Setup
Each machine gets the service enabled in its Nix configuration:
```nix
# hosts/<hostname>/configuration.nix
{
imports = [
../../nix/modules/lab-auto-update.nix
];
services.lab-auto-update = {
enable = true;
schedule = "02:00";
autoReboot = true;
};
}
```
### Staggered Scheduling
Different machines can have different update times to avoid all rebooting simultaneously:
```nix
# Example configurations
# db-server.nix
services.lab-auto-update.schedule = "02:00";
# web-servers.nix
services.lab-auto-update.schedule = "02:30";
# dev-machines.nix
services.lab-auto-update.schedule = "03:00";
```
## Implementation Steps
### Step 1: Create Nix Module
- Create the service module file
- Add to common imports
- Test on single machine
### Step 2: Extend Lab Tool
- Add UpdateSystemCommand class
- Integrate CLI commands
- Test update functionality
### Step 3: Deploy Gradually
- Enable on non-critical machines first
- Monitor logs and behavior
- Roll out to all machines
### Step 4: Monitoring Setup
- Log rotation configuration
- Status reporting
- Alert on failures
## Safety Features
### Pre-Update Checks
```bash
# Basic health check before update
if ! systemctl is-system-running --quiet; then
echo "System not healthy, skipping update"
exit 1
fi
# Check disk space
if [[ $(df / | tail -1 | awk '{print $5}' | sed 's/%//') -gt 90 ]]; then
echo "Low disk space, skipping update"
exit 1
fi
```
### Rollback on Boot Failure
```nix
# Enable automatic rollback
boot.loader.grub.configurationLimit = 10;
systemd.services."rollback-on-failure" = {
description = "Rollback on boot failure";
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
};
script = ''
# This runs if we successfully boot
# Clear any failure flags
rm -f /var/lib/update-failed
'';
wantedBy = [ "multi-user.target" ];
};
```
## Monitoring and Logging
### Log Management
```nix
# Add to service configuration
services.logrotate.settings.lab-auto-update = {
files = "/var/log/lab-auto-update.log";
rotate = 30;
daily = true;
compress = true;
missingok = true;
notifempty = true;
};
```
### Status Reporting
```python
# lab/commands/status.py additions
def update_status():
"""Show auto-update status"""
log_file = "/var/log/lab-auto-update.log"
if os.path.exists(log_file):
# Parse last update attempt
with open(log_file, 'r') as f:
lines = f.readlines()
# Show last few entries
for line in lines[-10:]:
print(line.strip())
# Show service status
result = subprocess.run(['systemctl', 'status', 'lab-auto-update.timer'],
capture_output=True, text=True)
print(result.stdout)
```
## Testing Plan
### Local Testing
1. Test lab tool commands manually
2. Test service creation and timer
3. Verify logging works
4. Test with dry-run options
### Gradual Rollout
1. Enable on development machine first
2. Monitor for one week
3. Enable on infrastructure machines
4. Finally enable on critical services
## Future Enhancements
### Simple Additions
- Email notifications on failure
- Webhook status reporting
- Update statistics tracking
- Configuration validation
### Advanced Features
- Update coordination between machines
- Dependency-aware scheduling
- Emergency update capabilities
- Integration with monitoring systems
## File Structure
```
/home/geir/Home-lab/
├── nix/modules/lab-auto-update.nix
├── lab/commands/update_system.py
├── lab/cli.py (modified)
└── scripts/
├── update-health-check.sh
└── emergency-rollback.sh
```
This plan provides a simple, reliable auto-update system that leverages the existing lab tool infrastructure while keeping complexity minimal for a homelab environment.

View file

@ -0,0 +1,402 @@
# Staggered Machine Update and Reboot System Research
## Overview
Research into implementing an automated system for updating and rebooting all lab machines in a staggered fashion using Nix, cronjobs, and our existing lab tool infrastructure.
## Goals
- Minimize downtime by updating machines in waves
- Ensure system stability with gradual rollouts
- Leverage Nix's atomic updates and rollback capabilities
- Integrate with existing lab tool for orchestration
- Provide monitoring and failure recovery
## Architecture Components
### 1. Update Controller
- Central orchestrator running on management node
- Maintains machine groups and update schedules
- Coordinates staggered execution
- Monitors update progress and health
### 2. Machine Groups
```
Group 1: Non-critical services (dev environments, testing)
Group 2: Infrastructure services (monitoring, logging)
Group 3: Critical services (databases, core applications)
Group 4: Management nodes (controllers, orchestrators)
```
### 3. Nix Integration
- Use `nixos-rebuild switch` for atomic updates
- Leverage Nix generations for rollback capability
- Update channels/flakes before rebuilding
- Validate configuration before applying
### 4. Lab Tool Integration
- Extend lab tool with update management commands
- Machine inventory and grouping
- Health check integration
- Status reporting and logging
## Implementation Strategy
### Phase 1: Basic Staggered Updates
```bash
# Example workflow per machine group
lab update prepare --group=dev
lab update execute --group=dev --wait-for-completion
lab update verify --group=dev
lab update prepare --group=infrastructure
# Continue with next group...
```
### Phase 2: Enhanced Orchestration
- Dependency-aware scheduling
- Health checks before proceeding to next group
- Automatic rollback on failures
- Notification system
### Phase 3: Advanced Features
- Blue-green deployments for critical services
- Canary releases
- Integration with monitoring systems
- Custom update policies per service
## Cronjob Design
### Master Cron Schedule
```cron
# Weekly full system update - Sundays at 2 AM
0 2 * * 0 /home/geir/Home-lab/scripts/staggered-update.sh
# Daily security updates for critical machines
0 3 * * * /home/geir/Home-lab/scripts/security-update.sh --group=critical
# Health check and cleanup
0 1 * * * /home/geir/Home-lab/scripts/update-health-check.sh
```
### Update Script Structure
```bash
#!/usr/bin/env bash
# staggered-update.sh
set -euo pipefail
# Configuration
GROUPS=("dev" "infrastructure" "critical" "management")
STAGGER_DELAY=30m
MAX_PARALLEL=3
# Log setup
LOG_DIR="/var/log/lab-updates"
LOG_FILE="$LOG_DIR/update-$(date +%Y%m%d-%H%M%S).log"
exec > >(tee -a "$LOG_FILE") 2>&1
for group in "${GROUPS[@]}"; do
echo "Starting update for group: $group"
# Pre-update checks
lab health-check --group="$group" || {
echo "Health check failed for $group, skipping"
continue
}
# Update Nix channels/flakes
lab update prepare --group="$group"
# Execute updates with parallelism control
lab update execute --group="$group" --parallel="$MAX_PARALLEL"
# Verify updates
lab update verify --group="$group" || {
echo "Verification failed for $group, initiating rollback"
lab update rollback --group="$group"
# Send alert
lab notify --level=error --message="Update failed for $group, rolled back"
exit 1
}
echo "Group $group updated successfully, waiting $STAGGER_DELAY"
sleep "$STAGGER_DELAY"
done
echo "All groups updated successfully"
lab notify --level=info --message="Staggered update completed successfully"
```
## Nix Configuration Management
### Centralized Configuration
```nix
# /home/geir/Home-lab/nix/update-config.nix
{
updateGroups = {
dev = {
machines = [ "dev-01" "dev-02" "test-env" ];
updatePolicy = "aggressive";
maintenanceWindow = "02:00-06:00";
allowReboot = true;
};
infrastructure = {
machines = [ "monitor-01" "log-server" "backup-01" ];
updatePolicy = "conservative";
maintenanceWindow = "03:00-05:00";
allowReboot = true;
dependencies = [ "dev" ];
};
critical = {
machines = [ "db-primary" "web-01" "web-02" ];
updatePolicy = "manual-approval";
maintenanceWindow = "04:00-05:00";
allowReboot = false; # Requires manual reboot
dependencies = [ "infrastructure" ];
};
};
updateSettings = {
maxParallel = 3;
healthCheckTimeout = 300;
rollbackOnFailure = true;
notificationChannels = [ "email" "discord" ];
};
}
```
### Machine-Specific Update Configurations
```nix
# On each machine: /etc/nixos/update-config.nix
{
services.lab-updater = {
enable = true;
group = "infrastructure";
preUpdateScript = ''
# Stop non-critical services
systemctl stop some-service
'';
postUpdateScript = ''
# Restart services and verify
systemctl start some-service
curl -f http://localhost:8080/health
'';
rollbackScript = ''
# Custom rollback procedures
systemctl stop some-service
nixos-rebuild switch --rollback
systemctl start some-service
'';
};
}
```
## Lab Tool Extensions
### New Commands
```bash
# Update management
lab update prepare [--group=GROUP] [--machine=MACHINE]
lab update execute [--group=GROUP] [--parallel=N] [--dry-run]
lab update verify [--group=GROUP]
lab update rollback [--group=GROUP] [--to-generation=N]
lab update status [--group=GROUP]
# Health and monitoring
lab health-check [--group=GROUP] [--timeout=SECONDS]
lab update-history [--group=GROUP] [--days=N]
lab notify [--level=LEVEL] [--message=MSG] [--channel=CHANNEL]
# Configuration
lab update-config show [--group=GROUP]
lab update-config set [--group=GROUP] [--key=KEY] [--value=VALUE]
```
### Integration Points
```python
# lab/commands/update.py
class UpdateCommand:
def prepare(self, group=None, machine=None):
"""Prepare machines for updates"""
# Update Nix channels/flakes
# Pre-update health checks
# Download packages
def execute(self, group=None, parallel=1, dry_run=False):
"""Execute updates on machines"""
# Run nixos-rebuild switch
# Monitor progress
# Handle failures
def verify(self, group=None):
"""Verify updates completed successfully"""
# Check system health
# Verify services
# Compare generations
```
## Monitoring and Alerting
### Health Checks
- Service availability checks
- Resource usage monitoring
- System log analysis
- Network connectivity tests
### Alerting Triggers
- Update failures
- Health check failures
- Rollback events
- Long-running updates
### Notification Channels
- Email notifications
- Discord/Slack integration
- Dashboard updates
- Log aggregation
## Safety Mechanisms
### Pre-Update Validation
- Configuration syntax checking
- Dependency verification
- Resource availability checks
- Backup verification
### During Update
- Progress monitoring
- Timeout handling
- Partial failure recovery
- Emergency stop capability
### Post-Update
- Service verification
- Performance monitoring
- Automatic rollback triggers
- Success confirmation
## Rollback Strategy
### Automatic Rollback Triggers
- Health check failures
- Service startup failures
- Critical error detection
- Timeout exceeded
### Manual Rollback
```bash
# Quick rollback to previous generation
lab update rollback --group=critical --immediate
# Rollback to specific generation
lab update rollback --group=infrastructure --to-generation=150
# Selective rollback (specific machines)
lab update rollback --machine=db-primary,web-01
```
## Testing Strategy
### Development Environment
- Test updates in isolated environment
- Validate scripts and configurations
- Performance testing
- Failure scenario testing
### Staging Rollout
- Deploy to staging group first
- Automated testing suite
- Manual verification
- Production deployment
## Security Considerations
- Secure communication channels
- Authentication for update commands
- Audit logging
- Access control for update scripts
- Encrypted configuration storage
## Future Enhancements
### Advanced Scheduling
- Maintenance window management
- Business hour awareness
- Holiday scheduling
- Emergency update capabilities
### Intelligence Features
- Machine learning for optimal timing
- Predictive failure detection
- Automatic dependency discovery
- Performance impact analysis
### Integration Expansions
- CI/CD pipeline integration
- Cloud provider APIs
- Container orchestration
- Configuration management systems
## Implementation Roadmap
### Phase 1 (2 weeks)
- Basic staggered update script
- Simple group management
- Nix integration
- Basic health checks
### Phase 2 (4 weeks)
- Lab tool integration
- Advanced scheduling
- Monitoring and alerting
- Rollback mechanisms
### Phase 3 (6 weeks)
- Advanced features
- Performance optimization
- Extended integrations
- Documentation and training
## Resources and References
- NixOS Manual: System Administration
- Cron Best Practices
- Blue-Green Deployment Patterns
- Infrastructure as Code Principles
- Monitoring and Observability Patterns
## Conclusion
A well-designed staggered update system will significantly improve lab maintenance efficiency while reducing risk. The combination of Nix's atomic updates, careful orchestration, and comprehensive monitoring provides a robust foundation for automated infrastructure management.