some research and loose thoughts
This commit is contained in:
parent
076c38d829
commit
12fb56f35b
7 changed files with 2160 additions and 5 deletions
|
@ -12,6 +12,7 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
|
|||
### ✅ **Completed Components**
|
||||
|
||||
#### Task Master AI Core
|
||||
|
||||
- **Installation**: Claude Task Master AI successfully packaged for NixOS
|
||||
- **Local Binary**: Available at `/home/geir/Home-lab/result/bin/task-master-ai`
|
||||
- **Ollama Integration**: Configured to use local models (qwen3:4b, deepseek-r1:1.5b, gemma3:4b-it-qat)
|
||||
|
@ -19,6 +20,7 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
|
|||
- **VS Code Integration**: Configured for Cursor/VS Code with MCP protocol
|
||||
|
||||
#### Infrastructure Components
|
||||
|
||||
- **NixOS Service Module**: `rag-taskmaster.nix` implemented with full configuration options
|
||||
- **Active Projects**:
|
||||
- Home lab (deploy-rs integration): 90% complete (9/10 tasks done)
|
||||
|
@ -28,23 +30,27 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
|
|||
### 🔄 **In Progress**
|
||||
|
||||
#### RAG System Implementation
|
||||
|
||||
- **Status**: Planned but not yet deployed
|
||||
- **Dependencies**: Need to implement RAG core components
|
||||
- **Module Ready**: NixOS service module exists but needs RAG implementation
|
||||
|
||||
#### MCP Integration for RAG
|
||||
|
||||
- **Status**: Bridge architecture designed
|
||||
- **Requirements**: Need to implement RAG MCP server alongside existing Task Master MCP
|
||||
|
||||
### 📋 **Outstanding Requirements**
|
||||
|
||||
#### Phase 1-3 Implementation Needed
|
||||
|
||||
1. **RAG Foundation** - Core RAG system with document indexing
|
||||
2. **MCP RAG Server** - Separate MCP server for document queries
|
||||
3. **Production Deployment** - Deploy services to grey-area server
|
||||
4. **Cross-Service Integration** - Connect RAG and Task Master systems
|
||||
|
||||
### 🎯 **Current Active Focus**
|
||||
|
||||
- Deploy-rs integration project (nearly complete)
|
||||
- Guile home lab tooling migration (early phase)
|
||||
|
||||
|
@ -55,6 +61,7 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
|
|||
### ✅ **Completed Components**
|
||||
|
||||
#### Task Master AI Core
|
||||
|
||||
- **Installation**: Claude Task Master AI successfully packaged for NixOS
|
||||
- **Local Binary**: Available at `/home/geir/Home-lab/result/bin/task-master-ai`
|
||||
- **Ollama Integration**: Configured to use local models (qwen3:4b, deepseek-r1:1.5b, gemma3:4b-it-qat)
|
||||
|
@ -62,6 +69,7 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
|
|||
- **VS Code Integration**: Configured for Cursor/VS Code with MCP protocol
|
||||
|
||||
#### Infrastructure Components
|
||||
|
||||
- **NixOS Service Module**: `rag-taskmaster.nix` implemented with full configuration options
|
||||
- **Active Projects**:
|
||||
- Home lab (deploy-rs integration): 90% complete (9/10 tasks done)
|
||||
|
@ -71,23 +79,27 @@ This roadmap outlines the complete integration of Retrieval Augmented Generation
|
|||
### 🔄 **In Progress**
|
||||
|
||||
#### RAG System Implementation
|
||||
|
||||
- **Status**: Planned but not yet deployed
|
||||
- **Dependencies**: Need to implement RAG core components
|
||||
- **Module Ready**: NixOS service module exists but needs RAG implementation
|
||||
|
||||
#### MCP Integration for RAG
|
||||
|
||||
- **Status**: Bridge architecture designed
|
||||
- **Requirements**: Need to implement RAG MCP server alongside existing Task Master MCP
|
||||
|
||||
### 📋 **Outstanding Requirements**
|
||||
|
||||
#### Phase 1-3 Implementation Needed
|
||||
|
||||
1. **RAG Foundation** - Core RAG system with document indexing
|
||||
2. **MCP RAG Server** - Separate MCP server for document queries
|
||||
3. **Production Deployment** - Deploy services to grey-area server
|
||||
4. **Cross-Service Integration** - Connect RAG and Task Master systems
|
||||
|
||||
### 🎯 **Current Active Focus**
|
||||
|
||||
- Deploy-rs integration project (nearly complete)
|
||||
- Guile home lab tooling migration (early phase)
|
||||
|
||||
|
@ -256,6 +268,7 @@ graph TB
|
|||
**Status**: NixOS module exists, needs deployment and testing
|
||||
|
||||
**Completed Tasks**:
|
||||
|
||||
- ✅ NixOS module development (`rag-taskmaster.nix`)
|
||||
- ✅ Service configuration templates
|
||||
- ✅ User isolation and security configuration
|
||||
|
@ -294,6 +307,7 @@ graph TB
|
|||
**Status**: Core functionality complete, bridge integration needed
|
||||
|
||||
**Completed Tasks**:
|
||||
|
||||
- ✅ Task Master installation and packaging
|
||||
- ✅ Ollama integration configuration
|
||||
- ✅ MCP server with 25+ tools
|
||||
|
|
535
research/guile-configuration-strategy.md
Normal file
535
research/guile-configuration-strategy.md
Normal file
|
@ -0,0 +1,535 @@
|
|||
# Guile-Based Programmatic Configuration Strategy
|
||||
|
||||
## Overview
|
||||
Research into implementing a robust, programmatic configuration system using Guile's strengths for the lab tool, moving beyond simple YAML to leverage Scheme's expressiveness and composability.
|
||||
|
||||
## Why Guile for Configuration?
|
||||
|
||||
### Advantages Over YAML
|
||||
- **Programmable**: Logic, conditionals, functions in configuration
|
||||
- **Composable**: Reusable configuration snippets and inheritance
|
||||
- **Type Safety**: Scheme's type system prevents configuration errors
|
||||
- **Extensible**: Custom DSL capabilities for lab-specific concepts
|
||||
- **Dynamic**: Runtime configuration generation and validation
|
||||
- **Functional**: Pure functions for configuration transformation
|
||||
|
||||
### Guile-Specific Benefits
|
||||
- **S-expressions**: Natural data structure representation
|
||||
- **Modules**: Clean separation of configuration concerns
|
||||
- **Macros**: Custom syntax for common patterns
|
||||
- **GOOPS**: Object-oriented configuration when needed
|
||||
- **Records**: Structured data with validation
|
||||
|
||||
## Configuration Architecture
|
||||
|
||||
### Hierarchical Structure
|
||||
```
|
||||
config/
|
||||
├── machines/ # Machine-specific configurations
|
||||
│ ├── sleeper-service.scm
|
||||
│ ├── grey-area.scm
|
||||
│ ├── reverse-proxy.scm
|
||||
│ └── orchestrator.scm
|
||||
├── groups/ # Machine group definitions
|
||||
│ ├── infrastructure.scm
|
||||
│ ├── services.scm
|
||||
│ └── development.scm
|
||||
├── environments/ # Environment-specific configs
|
||||
│ ├── production.scm
|
||||
│ ├── staging.scm
|
||||
│ └── development.scm
|
||||
├── templates/ # Reusable configuration templates
|
||||
│ ├── web-server.scm
|
||||
│ ├── database.scm
|
||||
│ └── monitoring.scm
|
||||
└── base.scm # Core configuration framework
|
||||
```
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### 1. Configuration Framework Module
|
||||
```scheme
|
||||
;; config/base.scm - Core configuration framework
|
||||
(define-module (config base)
|
||||
#:use-module (srfi srfi-9) ; Records
|
||||
#:use-module (srfi srfi-1) ; Lists
|
||||
#:use-module (ice-9 match)
|
||||
#:use-module (ice-9 format)
|
||||
#:export (define-machine
|
||||
define-group
|
||||
define-environment
|
||||
machine?
|
||||
group?
|
||||
get-machine-config
|
||||
get-group-machines
|
||||
validate-config
|
||||
merge-configs
|
||||
resolve-inheritance))
|
||||
|
||||
;; Machine record type with validation
|
||||
(define-record-type <machine>
|
||||
(make-machine name hostname user services groups environment metadata)
|
||||
machine?
|
||||
(name machine-name)
|
||||
(hostname machine-hostname)
|
||||
(user machine-user)
|
||||
(services machine-services)
|
||||
(groups machine-groups)
|
||||
(environment machine-environment)
|
||||
(metadata machine-metadata))
|
||||
|
||||
;; Group record type
|
||||
(define-record-type <group>
|
||||
(make-group name machines deployment-order dependencies metadata)
|
||||
group?
|
||||
(name group-name)
|
||||
(machines group-machines)
|
||||
(deployment-order group-deployment-order)
|
||||
(dependencies group-dependencies)
|
||||
(metadata group-metadata))
|
||||
|
||||
;; Environment record type
|
||||
(define-record-type <environment>
|
||||
(make-environment name settings overrides)
|
||||
environment?
|
||||
(name environment-name)
|
||||
(settings environment-settings)
|
||||
(overrides environment-overrides))
|
||||
|
||||
;; Configuration DSL macros
|
||||
(define-syntax define-machine
|
||||
(syntax-rules ()
|
||||
((_ name hostname config ...)
|
||||
(make-machine 'name hostname
|
||||
(parse-machine-config (list config ...))))))
|
||||
|
||||
(define-syntax define-group
|
||||
(syntax-rules ()
|
||||
((_ name machines ...)
|
||||
(make-group 'name (list machines ...)
|
||||
(parse-group-config (list machines ...))))))
|
||||
|
||||
;; Pure function: Parse machine configuration
|
||||
(define (parse-machine-config config-list)
|
||||
"Parse machine configuration from keyword-value pairs"
|
||||
(let loop ((config config-list)
|
||||
(result '()))
|
||||
(match config
|
||||
(() result)
|
||||
((#:user user . rest)
|
||||
(loop rest (cons `(user . ,user) result)))
|
||||
((#:services services . rest)
|
||||
(loop rest (cons `(services . ,services) result)))
|
||||
((#:groups groups . rest)
|
||||
(loop rest (cons `(groups . ,groups) result)))
|
||||
((#:environment env . rest)
|
||||
(loop rest (cons `(environment . ,env) result)))
|
||||
((key value . rest)
|
||||
(loop rest (cons `(,key . ,value) result))))))
|
||||
|
||||
;; Configuration inheritance resolver
|
||||
(define (resolve-inheritance machine-config template-configs)
|
||||
"Resolve configuration inheritance from templates"
|
||||
(fold merge-configs machine-config template-configs))
|
||||
|
||||
;; Configuration merger
|
||||
(define (merge-configs base-config override-config)
|
||||
"Merge two configurations, with override taking precedence"
|
||||
(append override-config
|
||||
(filter (lambda (item)
|
||||
(not (assoc (car item) override-config)))
|
||||
base-config)))
|
||||
|
||||
;; Configuration validator
|
||||
(define (validate-config config)
|
||||
"Validate configuration completeness and consistency"
|
||||
(and (assoc 'hostname config)
|
||||
(assoc 'user config)
|
||||
(string? (assoc-ref config 'hostname))
|
||||
(string? (assoc-ref config 'user))))
|
||||
```
|
||||
|
||||
### 2. Machine Configuration Examples
|
||||
```scheme
|
||||
;; config/machines/sleeper-service.scm
|
||||
(define-module (config machines sleeper-service)
|
||||
#:use-module (config base)
|
||||
#:use-module (config templates web-server)
|
||||
#:export (sleeper-service-config))
|
||||
|
||||
(define sleeper-service-config
|
||||
(define-machine sleeper-service "sleeper-service.local"
|
||||
#:user "root"
|
||||
#:services '(nginx postgresql redis)
|
||||
#:groups '(infrastructure database)
|
||||
#:environment 'production
|
||||
#:ssh-port 22
|
||||
#:deploy-strategy 'rolling
|
||||
#:health-checks '((http "http://localhost:80/health")
|
||||
(tcp 5432)
|
||||
(tcp 6379))
|
||||
#:dependencies '()
|
||||
#:reboot-delay 0 ; First to reboot
|
||||
#:backup-required #t
|
||||
#:monitoring-enabled #t
|
||||
#:metadata `((description . "Main application server")
|
||||
(maintainer . "geir")
|
||||
(criticality . high))))
|
||||
|
||||
;; config/machines/grey-area.scm
|
||||
(define-module (config machines grey-area)
|
||||
#:use-module (config base)
|
||||
#:use-module (config templates monitoring)
|
||||
#:export (grey-area-config))
|
||||
|
||||
(define grey-area-config
|
||||
(define-machine grey-area "grey-area.local"
|
||||
#:user "root"
|
||||
#:services '(prometheus grafana alertmanager)
|
||||
#:groups '(infrastructure monitoring)
|
||||
#:environment 'production
|
||||
#:ssh-port 22
|
||||
#:deploy-strategy 'blue-green
|
||||
#:health-checks '((http "http://localhost:3000/health")
|
||||
(http "http://localhost:9090/-/healthy"))
|
||||
#:dependencies '(sleeper-service)
|
||||
#:reboot-delay 600 ; 10 minutes after sleeper-service
|
||||
#:backup-required #f
|
||||
#:monitoring-enabled #t
|
||||
#:metadata `((description . "Monitoring and observability")
|
||||
(maintainer . "geir")
|
||||
(criticality . medium))))
|
||||
|
||||
;; config/machines/reverse-proxy.scm
|
||||
(define-module (config machines reverse-proxy)
|
||||
#:use-module (config base)
|
||||
#:use-module (config templates proxy)
|
||||
#:export (reverse-proxy-config))
|
||||
|
||||
(define reverse-proxy-config
|
||||
(define-machine reverse-proxy "reverse-proxy.local"
|
||||
#:user "root"
|
||||
#:services '(nginx traefik)
|
||||
#:groups '(infrastructure edge)
|
||||
#:environment 'production
|
||||
#:ssh-port 22
|
||||
#:deploy-strategy 'rolling
|
||||
#:health-checks '((http "http://localhost:80/health")
|
||||
(tcp 443))
|
||||
#:dependencies '(sleeper-service grey-area)
|
||||
#:reboot-delay 1200 ; 20 minutes after sleeper-service
|
||||
#:backup-required #f
|
||||
#:monitoring-enabled #t
|
||||
#:public-facing #t
|
||||
#:ssl-certificates '("homelab.local" "*.homelab.local")
|
||||
#:metadata `((description . "Edge proxy and load balancer")
|
||||
(maintainer . "geir")
|
||||
(criticality . high))))
|
||||
```
|
||||
|
||||
### 3. Group Configuration
|
||||
```scheme
|
||||
;; config/groups/infrastructure.scm
|
||||
(define-module (config groups infrastructure)
|
||||
#:use-module (config base)
|
||||
#:export (infrastructure-group))
|
||||
|
||||
(define infrastructure-group
|
||||
(define-group infrastructure
|
||||
#:machines '(sleeper-service grey-area reverse-proxy)
|
||||
#:deployment-order '(sleeper-service grey-area reverse-proxy)
|
||||
#:reboot-sequence '((sleeper-service . 0)
|
||||
(grey-area . 600)
|
||||
(reverse-proxy . 1200))
|
||||
#:update-strategy 'staggered
|
||||
#:rollback-strategy 'reverse-order
|
||||
#:health-check-required #t
|
||||
#:maintenance-window '(02:00 . 06:00)
|
||||
#:notification-channels '(email discord)
|
||||
#:metadata `((description . "Core infrastructure services")
|
||||
(owner . "platform-team")
|
||||
(sla . "99.9%"))))
|
||||
|
||||
;; config/groups/services.scm
|
||||
(define-module (config groups services)
|
||||
#:use-module (config base)
|
||||
#:export (services-group))
|
||||
|
||||
(define services-group
|
||||
(define-group services
|
||||
#:machines '(app-server-01 app-server-02 worker-01)
|
||||
#:deployment-order '(worker-01 app-server-01 app-server-02)
|
||||
#:update-strategy 'rolling
|
||||
#:canary-percentage 25
|
||||
#:health-check-required #t
|
||||
#:dependencies '(infrastructure)
|
||||
#:metadata `((description . "Application services")
|
||||
(owner . "application-team"))))
|
||||
```
|
||||
|
||||
### 4. Template System
|
||||
```scheme
|
||||
;; config/templates/web-server.scm
|
||||
(define-module (config templates web-server)
|
||||
#:use-module (config base)
|
||||
#:export (web-server-template))
|
||||
|
||||
(define web-server-template
|
||||
'((services . (nginx))
|
||||
(ports . (80 443))
|
||||
(health-checks . ((http "http://localhost:80/health")))
|
||||
(deploy-strategy . rolling)
|
||||
(backup-required . #f)
|
||||
(monitoring-enabled . #t)
|
||||
(firewall-rules . ((allow 80 tcp)
|
||||
(allow 443 tcp)))))
|
||||
|
||||
;; config/templates/database.scm
|
||||
(define-module (config templates database)
|
||||
#:use-module (config base)
|
||||
#:export (database-template))
|
||||
|
||||
(define database-template
|
||||
'((services . (postgresql))
|
||||
(ports . (5432))
|
||||
(health-checks . ((tcp 5432)
|
||||
(pg-isready)))
|
||||
(deploy-strategy . blue-green)
|
||||
(backup-required . #t)
|
||||
(backup-schedule . "0 2 * * *")
|
||||
(monitoring-enabled . #t)
|
||||
(replication-enabled . #f)
|
||||
(firewall-rules . ((allow 5432 tcp internal)))))
|
||||
```
|
||||
|
||||
### 5. Configuration Loader Integration
|
||||
```scheme
|
||||
;; lab/config-loader.scm - Integration with existing lab tool
|
||||
(define-module (lab config-loader)
|
||||
#:use-module (config base)
|
||||
#:use-module (config machines sleeper-service)
|
||||
#:use-module (config machines grey-area)
|
||||
#:use-module (config machines reverse-proxy)
|
||||
#:use-module (config groups infrastructure)
|
||||
#:use-module (utils logging)
|
||||
#:export (load-lab-config
|
||||
get-all-machines
|
||||
get-machine-info
|
||||
get-reboot-sequence
|
||||
get-deployment-order))
|
||||
|
||||
;; Global configuration registry
|
||||
(define *lab-config*
|
||||
`((machines . ,(list sleeper-service-config
|
||||
grey-area-config
|
||||
reverse-proxy-config))
|
||||
(groups . ,(list infrastructure-group))
|
||||
(environments . ())))
|
||||
|
||||
;; Pure function: Get all machine configurations
|
||||
(define (get-all-machines-from-config)
|
||||
"Get all machine configurations"
|
||||
(assoc-ref *lab-config* 'machines))
|
||||
|
||||
;; Pure function: Find machine by name
|
||||
(define (find-machine-by-name name machines)
|
||||
"Find machine configuration by name"
|
||||
(find (lambda (machine)
|
||||
(eq? (machine-name machine) name))
|
||||
machines))
|
||||
|
||||
;; Integration function: Get machine info for existing lab tool
|
||||
(define (get-machine-info machine-name)
|
||||
"Get machine information in format expected by existing lab tool"
|
||||
(let* ((machines (get-all-machines-from-config))
|
||||
(machine (find-machine-by-name machine-name machines)))
|
||||
(if machine
|
||||
`((hostname . ,(machine-hostname machine))
|
||||
(user . ,(machine-user machine))
|
||||
(ssh-port . ,(assoc-ref (machine-metadata machine) 'ssh-port))
|
||||
(is-local . ,(string=? (machine-hostname machine) "localhost")))
|
||||
#f)))
|
||||
|
||||
;; Get reboot sequence for orchestrator
|
||||
(define (get-reboot-sequence)
|
||||
"Get the ordered reboot sequence with delays"
|
||||
(let ((infra-group (car (assoc-ref *lab-config* 'groups))))
|
||||
(assoc-ref (group-metadata infra-group) 'reboot-sequence)))
|
||||
|
||||
;; Get deployment order
|
||||
(define (get-deployment-order group-name)
|
||||
"Get deployment order for a group"
|
||||
(let* ((groups (assoc-ref *lab-config* 'groups))
|
||||
(group (find (lambda (g) (eq? (group-name g) group-name)) groups)))
|
||||
(if group
|
||||
(group-deployment-order group)
|
||||
'())))
|
||||
```
|
||||
|
||||
### 6. Integration with Existing Lab Tool
|
||||
```scheme
|
||||
;; Update lab/machines.scm to use new config system
|
||||
(define-module (lab machines)
|
||||
#:use-module (lab config-loader)
|
||||
;; ...existing modules...
|
||||
#:export (;; ...existing exports...
|
||||
get-ssh-config
|
||||
validate-machine-name
|
||||
list-machines))
|
||||
|
||||
;; Update existing functions to use new config system
|
||||
(define (get-ssh-config machine-name)
|
||||
"Get SSH configuration for machine - updated to use new config"
|
||||
(get-machine-info machine-name))
|
||||
|
||||
(define (validate-machine-name machine-name)
|
||||
"Validate machine name exists in configuration"
|
||||
(let ((machine-info (get-machine-info machine-name)))
|
||||
(not (eq? machine-info #f))))
|
||||
|
||||
(define (list-machines)
|
||||
"List all configured machines"
|
||||
(map machine-name (get-all-machines-from-config)))
|
||||
|
||||
;; New function: Get machines by group
|
||||
(define (get-machines-in-group group-name)
|
||||
"Get all machines in a specific group"
|
||||
(let ((deployment-order (get-deployment-order group-name)))
|
||||
(if deployment-order
|
||||
deployment-order
|
||||
'())))
|
||||
```
|
||||
|
||||
## Advanced Configuration Features
|
||||
|
||||
### 1. Environment-Specific Overrides
|
||||
```scheme
|
||||
;; config/environments/production.scm
|
||||
(define production-environment
|
||||
(make-environment 'production
|
||||
;; Base settings
|
||||
'((log-level . info)
|
||||
(debug-mode . #f)
|
||||
(monitoring-enabled . #t)
|
||||
(backup-enabled . #t))
|
||||
;; Machine-specific overrides
|
||||
'((sleeper-service . ((log-level . warn)
|
||||
(max-connections . 1000)))
|
||||
(grey-area . ((retention-days . 90))))))
|
||||
|
||||
;; config/environments/development.scm
|
||||
(define development-environment
|
||||
(make-environment 'development
|
||||
'((log-level . debug)
|
||||
(debug-mode . #t)
|
||||
(monitoring-enabled . #f)
|
||||
(backup-enabled . #f))
|
||||
'()))
|
||||
```
|
||||
|
||||
### 2. Dynamic Configuration Generation
|
||||
```scheme
|
||||
;; config/generators/auto-scaling.scm
|
||||
(define (generate-web-server-configs count)
|
||||
"Dynamically generate web server configurations"
|
||||
(map (lambda (i)
|
||||
(define-machine (string->symbol (format #f "web-~2,'0d" i))
|
||||
(format #f "web-~2,'0d.local" i)
|
||||
#:user "root"
|
||||
#:services '(nginx)
|
||||
#:groups '(web-servers)
|
||||
#:template web-server-template))
|
||||
(iota count 1)))
|
||||
|
||||
;; Usage in configuration
|
||||
(define web-servers (generate-web-server-configs 3))
|
||||
```
|
||||
|
||||
### 3. Configuration Validation
|
||||
```scheme
|
||||
;; config/validation.scm
|
||||
(define-module (config validation)
|
||||
#:use-module (config base)
|
||||
#:export (validate-lab-config
|
||||
check-dependencies
|
||||
validate-network-topology))
|
||||
|
||||
(define (validate-lab-config config)
|
||||
"Comprehensive configuration validation"
|
||||
(and (validate-machine-configs config)
|
||||
(validate-group-dependencies config)
|
||||
(validate-network-topology config)
|
||||
(validate-reboot-sequence config)))
|
||||
|
||||
(define (validate-machine-configs config)
|
||||
"Validate all machine configurations"
|
||||
(every validate-config
|
||||
(map machine-metadata
|
||||
(assoc-ref config 'machines))))
|
||||
|
||||
(define (validate-reboot-sequence config)
|
||||
"Validate reboot sequence dependencies"
|
||||
(let ((sequence (get-reboot-sequence)))
|
||||
(check-dependency-order sequence)))
|
||||
```
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Phase 1: Parallel Configuration System
|
||||
1. Implement new config modules alongside existing YAML
|
||||
2. Add config-loader integration layer
|
||||
3. Update lab tool to optionally use new system
|
||||
4. Validate equivalent behavior
|
||||
|
||||
### Phase 2: Feature Enhancement
|
||||
1. Add dynamic configuration capabilities
|
||||
2. Implement validation and error checking
|
||||
3. Add environment-specific overrides
|
||||
4. Enhance orchestrator with new features
|
||||
|
||||
### Phase 3: Full Migration
|
||||
1. Migrate all existing configurations
|
||||
2. Remove YAML dependency
|
||||
3. Add advanced features (templates, inheritance)
|
||||
4. Optimize performance
|
||||
|
||||
## Benefits of This Approach
|
||||
|
||||
### Developer Experience
|
||||
- **Rich Configuration**: Logic and computation in config
|
||||
- **Type Safety**: Catch errors at config load time
|
||||
- **Reusability**: Templates and inheritance reduce duplication
|
||||
- **Composability**: Mix and match configuration components
|
||||
- **Validation**: Comprehensive consistency checking
|
||||
|
||||
### Operational Benefits
|
||||
- **Dynamic Scaling**: Generate configurations programmatically
|
||||
- **Environment Management**: Seamless dev/staging/prod handling
|
||||
- **Dependency Tracking**: Automatic dependency resolution
|
||||
- **Extensibility**: Easy to add new machine types and features
|
||||
|
||||
### Integration Advantages
|
||||
- **Native Guile**: No external dependencies or parsers
|
||||
- **Performance**: Compiled configuration, fast access
|
||||
- **Debugging**: Full Guile debugging tools available
|
||||
- **Flexibility**: Can mix declarative and imperative approaches
|
||||
|
||||
## File Structure Summary
|
||||
```
|
||||
/home/geir/Home-lab/
|
||||
├── packages/lab-tool/
|
||||
│ ├── config/
|
||||
│ │ ├── base.scm # Configuration framework
|
||||
│ │ ├── machines/ # Machine definitions
|
||||
│ │ ├── groups/ # Group definitions
|
||||
│ │ ├── environments/ # Environment configs
|
||||
│ │ ├── templates/ # Reusable templates
|
||||
│ │ └── validation.scm # Configuration validation
|
||||
│ ├── lab/
|
||||
│ │ ├── config-loader.scm # Integration layer
|
||||
│ │ ├── machines.scm # Updated machine management
|
||||
│ │ └── ...existing modules...
|
||||
│ └── main.scm # Updated main entry point
|
||||
```
|
||||
|
||||
This approach leverages Guile's strengths to create a powerful, flexible configuration system that grows with your homelab while maintaining the K.I.S.S principles of your current tool.
|
455
research/lab-orchestrator-service.md
Normal file
455
research/lab-orchestrator-service.md
Normal file
|
@ -0,0 +1,455 @@
|
|||
# Lab-Wide Auto-Update Service with Staggered Reboots
|
||||
|
||||
## Overview
|
||||
A NixOS service that runs on this machine (orchestrator) to update the entire homelab using existing lab tool commands, then perform staggered reboots to ensure you wake up to a freshly updated lab every morning.
|
||||
|
||||
## Service Architecture
|
||||
|
||||
### Central Orchestrator Approach
|
||||
- Runs on this machine (the controller)
|
||||
- Uses existing `lab update` and `lab deploy-all` commands
|
||||
- Orchestrates staggered reboots: sleeper-service → grey-area → reverse-proxy → self
|
||||
- 10-minute delays between each machine reboot
|
||||
|
||||
## Implementation
|
||||
|
||||
### 1. Nix Service Module
|
||||
```nix
|
||||
# /home/geir/Home-lab/nix/modules/lab-orchestrator.nix
|
||||
{ config, lib, pkgs, ... }:
|
||||
|
||||
with lib;
|
||||
|
||||
let
|
||||
cfg = config.services.lab-orchestrator;
|
||||
|
||||
labPath = "/home/geir/Home-lab";
|
||||
|
||||
# Machine reboot order with delays
|
||||
rebootSequence = [
|
||||
{ machine = "sleeper-service"; delay = 0; }
|
||||
{ machine = "grey-area"; delay = 600; } # 10 minutes
|
||||
{ machine = "reverse-proxy"; delay = 1200; } # 20 minutes total
|
||||
{ machine = "self"; delay = 1800; } # 30 minutes total
|
||||
];
|
||||
|
||||
orchestratorScript = pkgs.writeShellScript "lab-orchestrator" ''
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
LOG_FILE="/var/log/lab-orchestrator.log"
|
||||
LAB_TOOL="${labPath}/result/bin/lab"
|
||||
|
||||
log() {
|
||||
echo "$(date '+%Y-%m-%d %H:%M:%S'): $1" | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
# Ensure lab tool is available
|
||||
if [[ ! -x "$LAB_TOOL" ]]; then
|
||||
log "ERROR: Lab tool not found at $LAB_TOOL"
|
||||
log "Building lab tool first..."
|
||||
cd "${labPath}"
|
||||
if ! nix build .#lab-tool; then
|
||||
log "ERROR: Failed to build lab tool"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
log "=== Starting Lab-Wide Update Orchestration ==="
|
||||
|
||||
# Step 1: Update flake inputs
|
||||
log "Updating flake inputs..."
|
||||
cd "${labPath}"
|
||||
if ! $LAB_TOOL update; then
|
||||
log "ERROR: Failed to update flake inputs"
|
||||
exit 1
|
||||
fi
|
||||
log "Flake inputs updated successfully"
|
||||
|
||||
# Step 2: Deploy to all machines
|
||||
log "Deploying to all machines..."
|
||||
if ! $LAB_TOOL deploy-all; then
|
||||
log "ERROR: Failed to deploy to all machines"
|
||||
exit 1
|
||||
fi
|
||||
log "Deployment completed successfully"
|
||||
|
||||
# Step 3: Staggered reboots
|
||||
log "Starting staggered reboot sequence..."
|
||||
|
||||
# Reboot sleeper-service immediately
|
||||
log "Rebooting sleeper-service..."
|
||||
if $LAB_TOOL reboot sleeper-service; then
|
||||
log "✓ sleeper-service reboot initiated"
|
||||
else
|
||||
log "WARNING: Failed to reboot sleeper-service"
|
||||
fi
|
||||
|
||||
# Wait 10 minutes, then reboot grey-area
|
||||
log "Waiting 10 minutes before rebooting grey-area..."
|
||||
sleep 600
|
||||
log "Rebooting grey-area..."
|
||||
if $LAB_TOOL reboot grey-area; then
|
||||
log "✓ grey-area reboot initiated"
|
||||
else
|
||||
log "WARNING: Failed to reboot grey-area"
|
||||
fi
|
||||
|
||||
# Wait 10 minutes, then reboot reverse-proxy
|
||||
log "Waiting 10 minutes before rebooting reverse-proxy..."
|
||||
sleep 600
|
||||
log "Rebooting reverse-proxy..."
|
||||
if $LAB_TOOL reboot reverse-proxy; then
|
||||
log "✓ reverse-proxy reboot initiated"
|
||||
else
|
||||
log "WARNING: Failed to reboot reverse-proxy"
|
||||
fi
|
||||
|
||||
# Wait 10 minutes, then reboot self
|
||||
log "Waiting 10 minutes before rebooting self..."
|
||||
sleep 600
|
||||
log "Rebooting this machine (orchestrator)..."
|
||||
log "=== Lab Update Orchestration Completed ==="
|
||||
|
||||
# Reboot this machine
|
||||
systemctl reboot
|
||||
'';
|
||||
|
||||
in
|
||||
{
|
||||
options.services.lab-orchestrator = {
|
||||
enable = mkEnableOption "Lab orchestrator auto-update service";
|
||||
|
||||
schedule = mkOption {
|
||||
type = types.str;
|
||||
default = "02:00";
|
||||
description = "Time to start lab update (HH:MM format)";
|
||||
};
|
||||
|
||||
user = mkOption {
|
||||
type = types.str;
|
||||
default = "geir";
|
||||
description = "User to run the lab tool as";
|
||||
};
|
||||
};
|
||||
|
||||
config = mkIf cfg.enable {
|
||||
systemd.services.lab-orchestrator = {
|
||||
description = "Lab-Wide Update Orchestrator";
|
||||
serviceConfig = {
|
||||
Type = "oneshot";
|
||||
User = cfg.user;
|
||||
Group = "users";
|
||||
WorkingDirectory = labPath;
|
||||
ExecStart = "${orchestratorScript}";
|
||||
# Give it plenty of time (2 hours)
|
||||
TimeoutStartSec = 7200;
|
||||
};
|
||||
# Ensure network is ready
|
||||
after = [ "network-online.target" ];
|
||||
wants = [ "network-online.target" ];
|
||||
};
|
||||
|
||||
systemd.timers.lab-orchestrator = {
|
||||
description = "Lab-Wide Update Orchestrator Timer";
|
||||
timerConfig = {
|
||||
OnCalendar = "*-*-* ${cfg.schedule}:00";
|
||||
Persistent = true;
|
||||
# No randomization - we want predictable timing
|
||||
};
|
||||
wantedBy = [ "timers.target" ];
|
||||
};
|
||||
|
||||
# Ensure log directory and file exist with proper permissions
|
||||
systemd.tmpfiles.rules = [
|
||||
"f /var/log/lab-orchestrator.log 0644 ${cfg.user} users -"
|
||||
];
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Lab Tool Reboot Command Extension
|
||||
Add reboot capability to the existing Guile lab tool:
|
||||
|
||||
```scheme
|
||||
;; lab/reboot.scm - New module for machine reboots
|
||||
(define-module (lab reboot)
|
||||
#:use-module (ice-9 format)
|
||||
#:use-module (ice-9 popen)
|
||||
#:use-module (utils logging)
|
||||
#:use-module (lab machines)
|
||||
#:export (reboot-machine))
|
||||
|
||||
(define (execute-ssh-command hostname command)
|
||||
"Execute command on remote machine via SSH"
|
||||
(let* ((ssh-cmd (format #f "ssh root@~a '~a'" hostname command))
|
||||
(port (open-input-pipe ssh-cmd))
|
||||
(output (read-string port)))
|
||||
(close-pipe port)
|
||||
output))
|
||||
|
||||
(define (reboot-machine machine-name)
|
||||
"Reboot a specific machine via SSH"
|
||||
(log-info "Attempting to reboot machine: ~a" machine-name)
|
||||
|
||||
(if (validate-machine-name machine-name)
|
||||
(let* ((ssh-config (get-ssh-config machine-name))
|
||||
(hostname (if ssh-config
|
||||
(assoc-ref ssh-config 'hostname)
|
||||
machine-name))
|
||||
(is-local (if ssh-config
|
||||
(assoc-ref ssh-config 'is-local)
|
||||
#f)))
|
||||
|
||||
(cond
|
||||
(is-local
|
||||
(log-info "Rebooting local machine...")
|
||||
(system "sudo systemctl reboot")
|
||||
#t)
|
||||
|
||||
(hostname
|
||||
(log-info "Rebooting ~a via SSH..." hostname)
|
||||
(catch #t
|
||||
(lambda ()
|
||||
;; Send reboot command - connection will drop
|
||||
(execute-ssh-command hostname "sudo systemctl reboot")
|
||||
(log-success "Reboot command sent to ~a" machine-name)
|
||||
#t)
|
||||
(lambda (key . args)
|
||||
;; SSH connection drop is expected during reboot
|
||||
(if (string-contains (format #f "~a" args) "Connection")
|
||||
(begin
|
||||
(log-info "Connection dropped (expected during reboot)")
|
||||
#t)
|
||||
(begin
|
||||
(log-error "Failed to reboot ~a: ~a" machine-name args)
|
||||
#f)))))
|
||||
|
||||
(else
|
||||
(log-error "No hostname found for machine: ~a" machine-name)
|
||||
#f)))
|
||||
|
||||
(begin
|
||||
(log-error "Invalid machine name: ~a" machine-name)
|
||||
#f)))
|
||||
```
|
||||
|
||||
### 3. CLI Integration
|
||||
Update the main.scm dispatcher to include reboot command:
|
||||
|
||||
```scheme
|
||||
;; main.scm (additions to command dispatcher)
|
||||
(use-modules ;; ...existing modules...
|
||||
(lab reboot))
|
||||
|
||||
;; Add to dispatch-command function
|
||||
(define (dispatch-command command args)
|
||||
"Dispatch command with appropriate handler"
|
||||
(match command
|
||||
;; ...existing cases...
|
||||
|
||||
('reboot
|
||||
(if (null? args)
|
||||
(begin
|
||||
(log-error "reboot command requires machine name")
|
||||
(format #t "Usage: lab reboot <machine>\n"))
|
||||
(let ((result (reboot-machine (car args))))
|
||||
(if result
|
||||
(log-success "Reboot initiated")
|
||||
(log-error "Reboot failed")))))
|
||||
|
||||
;; ...rest of existing cases...
|
||||
))
|
||||
|
||||
;; Update help text to include reboot command
|
||||
(define (get-help-text)
|
||||
"Pure function returning help text"
|
||||
"Home Lab Tool - K.I.S.S Refactored Edition
|
||||
|
||||
USAGE: lab <command> [args...]
|
||||
|
||||
COMMANDS:
|
||||
status Show infrastructure status
|
||||
machines List all machines
|
||||
deploy <machine> Deploy configuration to machine
|
||||
deploy-all Deploy to all machines
|
||||
update Update flake inputs
|
||||
health [machine] Check machine health (all if no machine specified)
|
||||
ssh <machine> SSH to machine
|
||||
reboot <machine> Reboot machine via SSH
|
||||
test-modules Test modular implementation
|
||||
help Show this help
|
||||
|
||||
EXAMPLES:
|
||||
lab status
|
||||
lab machines
|
||||
lab deploy congenital-optimist
|
||||
lab deploy-all
|
||||
lab update
|
||||
lab health
|
||||
lab health sleeper-service
|
||||
lab ssh sleeper-service
|
||||
lab reboot sleeper-service
|
||||
lab test-modules
|
||||
")
|
||||
|
||||
### 4. Configuration
|
||||
Enable the service on this machine (the orchestrator):
|
||||
|
||||
```nix
|
||||
# hosts/this-machine/configuration.nix
|
||||
{
|
||||
imports = [
|
||||
../../nix/modules/lab-orchestrator.nix
|
||||
];
|
||||
|
||||
services.lab-orchestrator = {
|
||||
enable = true;
|
||||
schedule = "02:00"; # 2 AM start
|
||||
user = "geir";
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
## Timeline Breakdown
|
||||
|
||||
### Nightly Execution (Starting 2:00 AM)
|
||||
```
|
||||
02:00 - Start orchestration
|
||||
02:00-02:15 - Update flake inputs (lab update)
|
||||
02:15-02:45 - Deploy to all machines (lab deploy-all)
|
||||
02:45 - Reboot sleeper-service
|
||||
02:55 - Reboot grey-area (10 min later)
|
||||
03:05 - Reboot reverse-proxy (10 min later)
|
||||
03:15 - Reboot orchestrator machine (10 min later)
|
||||
03:20 - All machines back online and updated
|
||||
```
|
||||
|
||||
### Total Duration: ~1 hour 20 minutes
|
||||
- Deployment: ~30 minutes
|
||||
- Staggered reboots: ~50 minutes
|
||||
- Everything done by 3:20 AM
|
||||
|
||||
## Safety Features
|
||||
|
||||
### Logging and Monitoring
|
||||
```bash
|
||||
# Check orchestrator logs
|
||||
sudo journalctl -u lab-orchestrator.service -f
|
||||
|
||||
# Check orchestrator log file
|
||||
tail -f /var/log/lab-orchestrator.log
|
||||
|
||||
# Check timer status
|
||||
systemctl status lab-orchestrator.timer
|
||||
```
|
||||
|
||||
### Manual Controls
|
||||
```bash
|
||||
# Start update manually
|
||||
sudo systemctl start lab-orchestrator.service
|
||||
|
||||
# Disable automatic updates
|
||||
sudo systemctl disable lab-orchestrator.timer
|
||||
|
||||
# Check when next run is scheduled
|
||||
systemctl list-timers lab-orchestrator.timer
|
||||
```
|
||||
|
||||
### Recovery Options
|
||||
```bash
|
||||
# If orchestration fails, machines can be individually managed
|
||||
lab deploy sleeper-service
|
||||
lab deploy grey-area
|
||||
lab deploy reverse-proxy
|
||||
|
||||
# Emergency reboot sequence
|
||||
lab reboot sleeper-service
|
||||
sleep 600
|
||||
lab reboot grey-area
|
||||
sleep 600
|
||||
lab reboot reverse-proxy
|
||||
```
|
||||
|
||||
## Machine Configuration Requirements
|
||||
|
||||
### SSH Key Setup
|
||||
Ensure this machine can SSH to all target machines:
|
||||
```bash
|
||||
# Test connectivity
|
||||
ssh root@sleeper-service "echo 'Connection OK'"
|
||||
ssh root@grey-area "echo 'Connection OK'"
|
||||
ssh root@reverse-proxy "echo 'Connection OK'"
|
||||
```
|
||||
|
||||
### Lab Tool Configuration
|
||||
Ensure lab.yaml includes all machines:
|
||||
```yaml
|
||||
machines:
|
||||
sleeper-service:
|
||||
host: sleeper-service.local
|
||||
user: root
|
||||
grey-area:
|
||||
host: grey-area.local
|
||||
user: root
|
||||
reverse-proxy:
|
||||
host: reverse-proxy.local
|
||||
user: root
|
||||
```
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Create the Service Module
|
||||
Add the Nix module file and import it
|
||||
|
||||
### 2. Extend Lab Tool
|
||||
Add reboot command functionality
|
||||
|
||||
### 3. Test Components
|
||||
```bash
|
||||
# Build the lab tool first
|
||||
cd /home/geir/Home-lab
|
||||
nix build .#lab-tool
|
||||
|
||||
# Test lab commands work
|
||||
./result/bin/lab update
|
||||
./result/bin/lab deploy-all
|
||||
./result/bin/lab machines
|
||||
./result/bin/lab reboot sleeper-service # Test reboot (be careful!)
|
||||
```
|
||||
|
||||
### 4. Enable Service
|
||||
```bash
|
||||
# Add to configuration and rebuild
|
||||
nixos-rebuild switch
|
||||
|
||||
# Verify timer is active
|
||||
systemctl status lab-orchestrator.timer
|
||||
```
|
||||
|
||||
### 5. Monitor First Run
|
||||
```bash
|
||||
# Watch the logs during first execution
|
||||
sudo journalctl -u lab-orchestrator.service -f
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
### Morning Routine
|
||||
- Wake up to fully updated homelab
|
||||
- All services running latest versions
|
||||
- No manual intervention needed
|
||||
- Predictable update schedule
|
||||
|
||||
### Reliability
|
||||
- Uses existing, tested lab tool commands
|
||||
- Proper error handling and logging
|
||||
- Graceful degradation if individual reboots fail
|
||||
- Easy to disable or modify timing
|
||||
|
||||
### Visibility
|
||||
- Comprehensive logging of entire process
|
||||
- Clear timestamps for each phase
|
||||
- Easy troubleshooting if issues occur
|
||||
|
||||
This gives you the "wake up to fresh lab" experience with minimal complexity, leveraging your existing infrastructure!
|
|
@ -325,6 +325,7 @@ Netdata provides a comprehensive REST API that makes it perfect for integrating
|
|||
**Base URL**: `http://localhost:19999/api/v1/`
|
||||
|
||||
**Primary Endpoints**:
|
||||
|
||||
- `/api/v1/data` - Query time-series data
|
||||
- `/api/v1/charts` - Get available charts
|
||||
- `/api/v1/allmetrics` - Get all metrics in shell-friendly format
|
||||
|
@ -354,6 +355,7 @@ Netdata provides a comprehensive REST API that makes it perfect for integrating
|
|||
### API Query Examples
|
||||
|
||||
#### Basic Data Query
|
||||
|
||||
```bash
|
||||
# Get CPU system data for the last 60 seconds
|
||||
curl "http://localhost:19999/api/v1/data?chart=system.cpu&after=-60&dimensions=system"
|
||||
|
@ -386,6 +388,7 @@ curl "http://localhost:19999/api/v1/data?chart=system.cpu&after=-60&dimensions=s
|
|||
```
|
||||
|
||||
#### Available Charts Discovery
|
||||
|
||||
```bash
|
||||
# Get all available charts
|
||||
curl "http://localhost:19999/api/v1/charts"
|
||||
|
@ -398,18 +401,21 @@ curl "http://localhost:19999/api/v1/charts"
|
|||
```
|
||||
|
||||
#### Memory Usage Example
|
||||
|
||||
```bash
|
||||
# Get memory usage data with specific grouping
|
||||
curl "http://localhost:19999/api/v1/data?chart=system.ram&after=-300&points=60&group=average"
|
||||
```
|
||||
|
||||
#### Network Interface Metrics
|
||||
|
||||
```bash
|
||||
# Get network traffic for specific interface
|
||||
curl "http://localhost:19999/api/v1/data?chart=net.eth0&after=-60&dimensions=received,sent"
|
||||
```
|
||||
|
||||
#### All Metrics in Shell Format
|
||||
|
||||
```bash
|
||||
# Perfect for scripting and automation
|
||||
curl "http://localhost:19999/api/v1/allmetrics"
|
||||
|
@ -438,6 +444,7 @@ NETDATA_SYSTEM_RAM_USED=4096
|
|||
### Web Dashboard Integration Strategies
|
||||
|
||||
#### 1. Direct AJAX Calls
|
||||
|
||||
```javascript
|
||||
// Fetch CPU data for dashboard widget
|
||||
fetch('http://localhost:19999/api/v1/data?chart=system.cpu&after=-60&points=60')
|
||||
|
@ -449,6 +456,7 @@ fetch('http://localhost:19999/api/v1/data?chart=system.cpu&after=-60&points=60')
|
|||
```
|
||||
|
||||
#### 2. Server-Side Proxy
|
||||
|
||||
```javascript
|
||||
// Proxy through your web server to avoid CORS issues
|
||||
fetch('/api/netdata/system.cpu?after=-60')
|
||||
|
@ -457,6 +465,7 @@ fetch('/api/netdata/system.cpu?after=-60')
|
|||
```
|
||||
|
||||
#### 3. Real-Time Updates
|
||||
|
||||
```javascript
|
||||
// Poll for updates every second
|
||||
setInterval(() => {
|
||||
|
@ -537,23 +546,27 @@ setInterval(() => {
|
|||
### Integration Considerations
|
||||
|
||||
#### 1. **CORS Handling**
|
||||
|
||||
- Netdata allows cross-origin requests by default
|
||||
- For production, consider proxying through your web server
|
||||
- Use server-side API calls for sensitive environments
|
||||
|
||||
#### 2. **Performance Optimization**
|
||||
|
||||
- Cache frequently accessed chart definitions
|
||||
- Use appropriate `points` parameter to limit data transfer
|
||||
- Implement efficient polling strategies
|
||||
- Consider WebSocket connections for real-time updates
|
||||
|
||||
#### 3. **Data Processing**
|
||||
|
||||
- Netdata returns timestamps and values as arrays
|
||||
- Convert to your chart library's expected format
|
||||
- Handle missing data points gracefully
|
||||
- Implement data aggregation for longer time ranges
|
||||
|
||||
#### 4. **Error Handling**
|
||||
|
||||
```javascript
|
||||
async function safeNetdataFetch(endpoint) {
|
||||
try {
|
||||
|
@ -589,9 +602,9 @@ class MultiNodeDashboard {
|
|||
|
||||
### API Documentation Resources
|
||||
|
||||
- **Swagger Documentation**: https://learn.netdata.cloud/api
|
||||
- **OpenAPI Spec**: https://raw.githubusercontent.com/netdata/netdata/master/src/web/api/netdata-swagger.yaml
|
||||
- **Query Documentation**: https://learn.netdata.cloud/docs/developer-and-contributor-corner/rest-api/queries/
|
||||
- **Swagger Documentation**: <https://learn.netdata.cloud/api>
|
||||
- **OpenAPI Spec**: <https://raw.githubusercontent.com/netdata/netdata/master/src/web/api/netdata-swagger.yaml>
|
||||
- **Query Documentation**: <https://learn.netdata.cloud/docs/developer-and-contributor-corner/rest-api/queries/>
|
||||
|
||||
### Conclusion
|
||||
|
||||
|
|
382
research/nixos-native-approach.md
Normal file
382
research/nixos-native-approach.md
Normal file
|
@ -0,0 +1,382 @@
|
|||
# Leveraging NixOS Configuration vs Custom Implementation
|
||||
|
||||
## Current Situation Analysis
|
||||
|
||||
We're at risk of reimplementing significant functionality that NixOS already provides:
|
||||
|
||||
### What NixOS Already Handles
|
||||
- **Machine Configuration**: Complete system configuration as code
|
||||
- **Service Management**: Declarative service definitions
|
||||
- **Deployment**: `nixos-rebuild` with atomic updates
|
||||
- **Validation**: Configuration validation at build time
|
||||
- **Dependencies**: Service dependency management
|
||||
- **Environments**: Multiple configurations per machine
|
||||
- **Templates**: NixOS modules for reusable configuration
|
||||
- **Type Safety**: Nix language type system
|
||||
- **Inheritance**: Module imports and overrides
|
||||
|
||||
### What We're Duplicating
|
||||
- Machine metadata and properties
|
||||
- Service definitions and health checks
|
||||
- Deployment strategies and validation
|
||||
- Configuration inheritance and composition
|
||||
- Environment-specific overrides
|
||||
|
||||
## Better Approach: NixOS-Native Strategy
|
||||
|
||||
### Core Principle
|
||||
**Let NixOS handle configuration, let lab tool handle orchestration**
|
||||
|
||||
### Revised Architecture
|
||||
|
||||
#### 1. NixOS Handles Configuration
|
||||
```nix
|
||||
# hosts/sleeper-service/configuration.nix
|
||||
{ config, lib, pkgs, ... }:
|
||||
{
|
||||
# NixOS handles all the configuration
|
||||
services.nginx.enable = true;
|
||||
services.postgresql.enable = true;
|
||||
|
||||
# Lab-specific metadata as NixOS options
|
||||
lab.machine = {
|
||||
role = "application-server";
|
||||
groups = [ "infrastructure" "database" ];
|
||||
rebootOrder = 1;
|
||||
dependencies = [ ];
|
||||
healthChecks = [
|
||||
{ type = "http"; url = "http://localhost:80/health"; }
|
||||
{ type = "tcp"; port = 5432; }
|
||||
];
|
||||
orchestration = {
|
||||
deployStrategy = "rolling";
|
||||
rebootDelay = 0;
|
||||
criticalityLevel = "high";
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### 2. Lab Tool Handles Orchestration
|
||||
```scheme
|
||||
;; lab tool queries NixOS configuration, doesn't define it
|
||||
(define (get-machine-metadata machine-name)
|
||||
"Extract lab metadata from NixOS configuration"
|
||||
(let ((config-path (format #f "hosts/~a/configuration.nix" machine-name)))
|
||||
(extract-lab-metadata-from-nix-config config-path)))
|
||||
|
||||
(define (get-reboot-sequence)
|
||||
"Get reboot sequence from NixOS configurations"
|
||||
(let ((machines (get-all-machines)))
|
||||
(sort machines
|
||||
(lambda (a b)
|
||||
(< (get-reboot-order a) (get-reboot-order b))))))
|
||||
```
|
||||
|
||||
### Implementation Strategy
|
||||
|
||||
#### 1. Create NixOS Lab Module
|
||||
```nix
|
||||
# nix/modules/lab-machine.nix
|
||||
{ config, lib, pkgs, ... }:
|
||||
|
||||
with lib;
|
||||
|
||||
let
|
||||
cfg = config.lab.machine;
|
||||
in
|
||||
{
|
||||
options.lab.machine = {
|
||||
role = mkOption {
|
||||
type = types.str;
|
||||
description = "Machine role in the lab";
|
||||
example = "web-server";
|
||||
};
|
||||
|
||||
groups = mkOption {
|
||||
type = types.listOf types.str;
|
||||
default = [];
|
||||
description = "Groups this machine belongs to";
|
||||
};
|
||||
|
||||
rebootOrder = mkOption {
|
||||
type = types.int;
|
||||
description = "Order in reboot sequence (lower = earlier)";
|
||||
};
|
||||
|
||||
dependencies = mkOption {
|
||||
type = types.listOf types.str;
|
||||
default = [];
|
||||
description = "Machines this depends on";
|
||||
};
|
||||
|
||||
healthChecks = mkOption {
|
||||
type = types.listOf (types.submodule {
|
||||
options = {
|
||||
type = mkOption {
|
||||
type = types.enum [ "http" "tcp" "command" ];
|
||||
description = "Type of health check";
|
||||
};
|
||||
url = mkOption {
|
||||
type = types.nullOr types.str;
|
||||
default = null;
|
||||
description = "URL for HTTP health checks";
|
||||
};
|
||||
port = mkOption {
|
||||
type = types.nullOr types.int;
|
||||
default = null;
|
||||
description = "Port for TCP health checks";
|
||||
};
|
||||
command = mkOption {
|
||||
type = types.nullOr types.str;
|
||||
default = null;
|
||||
description = "Command for command-based health checks";
|
||||
};
|
||||
};
|
||||
});
|
||||
default = [];
|
||||
description = "Health check configurations";
|
||||
};
|
||||
|
||||
orchestration = mkOption {
|
||||
type = types.submodule {
|
||||
options = {
|
||||
deployStrategy = mkOption {
|
||||
type = types.enum [ "rolling" "blue-green" "recreate" ];
|
||||
default = "rolling";
|
||||
description = "Deployment strategy";
|
||||
};
|
||||
|
||||
rebootDelay = mkOption {
|
||||
type = types.int;
|
||||
default = 600; # 10 minutes
|
||||
description = "Delay in seconds before this machine reboots";
|
||||
};
|
||||
|
||||
criticalityLevel = mkOption {
|
||||
type = types.enum [ "low" "medium" "high" "critical" ];
|
||||
default = "medium";
|
||||
description = "Service criticality level";
|
||||
};
|
||||
};
|
||||
};
|
||||
default = {};
|
||||
description = "Orchestration configuration";
|
||||
};
|
||||
};
|
||||
|
||||
config = {
|
||||
# Generate machine metadata file for lab tool consumption
|
||||
environment.etc."lab-machine-metadata.json".text = builtins.toJSON {
|
||||
inherit (cfg) role groups rebootOrder dependencies healthChecks orchestration;
|
||||
hostname = config.networking.hostName;
|
||||
services = builtins.attrNames (lib.filterAttrs (n: v: v.enable or false) config.services);
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### 2. Simplified Lab Tool
|
||||
```scheme
|
||||
;; lab/nix-integration.scm - NixOS integration module
|
||||
(define-module (lab nix-integration)
|
||||
#:use-module (ice-9 format)
|
||||
#:use-module (ice-9 popen)
|
||||
#:use-module (json)
|
||||
#:export (get-machine-metadata-from-nix
|
||||
get-all-nix-machines
|
||||
get-reboot-sequence-from-nix
|
||||
build-nix-config
|
||||
evaluate-nix-expr))
|
||||
|
||||
(define (evaluate-nix-expr expr)
|
||||
"Evaluate a Nix expression and return the result"
|
||||
(let* ((cmd (format #f "nix eval --json --expr '~a'" expr))
|
||||
(port (open-input-pipe cmd))
|
||||
(output (read-string port)))
|
||||
(close-pipe port)
|
||||
(if (string-null? output)
|
||||
#f
|
||||
(json-string->scm output))))
|
||||
|
||||
(define (get-machine-metadata-from-nix machine-name)
|
||||
"Get machine metadata from NixOS configuration"
|
||||
(let* ((expr (format #f
|
||||
"(import ./hosts/~a/configuration.nix {}).lab.machine // { hostname = \"~a\"; }"
|
||||
machine-name machine-name))
|
||||
(metadata (evaluate-nix-expr expr)))
|
||||
metadata))
|
||||
|
||||
(define (get-all-nix-machines)
|
||||
"Get all machines by scanning hosts directory"
|
||||
(let* ((hosts-expr "(builtins.attrNames (builtins.readDir ./hosts))")
|
||||
(hosts (evaluate-nix-expr hosts-expr)))
|
||||
(if hosts hosts '())))
|
||||
|
||||
(define (get-reboot-sequence-from-nix)
|
||||
"Get reboot sequence from NixOS configurations"
|
||||
(let* ((machines (get-all-nix-machines))
|
||||
(machine-data (map (lambda (machine)
|
||||
(cons machine (get-machine-metadata-from-nix machine)))
|
||||
machines)))
|
||||
(sort machine-data
|
||||
(lambda (a b)
|
||||
(< (assoc-ref (cdr a) 'rebootOrder)
|
||||
(assoc-ref (cdr b) 'rebootOrder))))))
|
||||
```
|
||||
|
||||
#### 3. Updated Machine Configurations
|
||||
```nix
|
||||
# hosts/sleeper-service/configuration.nix
|
||||
{ config, lib, pkgs, ... }:
|
||||
{
|
||||
imports = [
|
||||
../../nix/modules/lab-machine.nix
|
||||
# ... other imports
|
||||
];
|
||||
|
||||
# Standard NixOS configuration
|
||||
services.nginx = {
|
||||
enable = true;
|
||||
# ... nginx config
|
||||
};
|
||||
|
||||
services.postgresql = {
|
||||
enable = true;
|
||||
# ... postgresql config
|
||||
};
|
||||
|
||||
# Lab orchestration metadata
|
||||
lab.machine = {
|
||||
role = "application-server";
|
||||
groups = [ "infrastructure" "backend" ];
|
||||
rebootOrder = 1;
|
||||
dependencies = [ ];
|
||||
healthChecks = [
|
||||
{
|
||||
type = "http";
|
||||
url = "http://localhost:80/health";
|
||||
}
|
||||
{
|
||||
type = "tcp";
|
||||
port = 5432;
|
||||
}
|
||||
];
|
||||
orchestration = {
|
||||
deployStrategy = "rolling";
|
||||
rebootDelay = 0;
|
||||
criticalityLevel = "high";
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### 4. Lab Tool Integration
|
||||
```scheme
|
||||
;; Update main.scm to use NixOS integration
|
||||
(use-modules ;; ...existing modules...
|
||||
(lab nix-integration))
|
||||
|
||||
(define (cmd-machines)
|
||||
"List all configured machines from NixOS"
|
||||
(log-info "Listing machines from NixOS configurations...")
|
||||
(let ((machines (get-all-nix-machines)))
|
||||
(format #t "Configured Machines (from NixOS):\n")
|
||||
(for-each (lambda (machine)
|
||||
(let ((metadata (get-machine-metadata-from-nix machine)))
|
||||
(format #t " ~a (~a) - ~a\n"
|
||||
machine
|
||||
(assoc-ref metadata 'role)
|
||||
(string-join (assoc-ref metadata 'groups) ", "))))
|
||||
machines)))
|
||||
|
||||
(define (cmd-orchestrator-sequence)
|
||||
"Show the orchestrated reboot sequence"
|
||||
(log-info "Getting reboot sequence from NixOS configurations...")
|
||||
(let ((sequence (get-reboot-sequence-from-nix)))
|
||||
(format #t "Reboot Sequence:\n")
|
||||
(for-each (lambda (machine-data)
|
||||
(let ((machine (car machine-data))
|
||||
(metadata (cdr machine-data)))
|
||||
(format #t " ~a. ~a (delay: ~a seconds)\n"
|
||||
(assoc-ref metadata 'rebootOrder)
|
||||
machine
|
||||
(assoc-ref metadata 'orchestration 'rebootDelay))))
|
||||
sequence)))
|
||||
```
|
||||
|
||||
### Benefits of This Approach
|
||||
|
||||
#### 1. Leverage NixOS Strengths
|
||||
- **Configuration Management**: NixOS handles all system configuration
|
||||
- **Validation**: Nix language validates configuration at build time
|
||||
- **Atomic Updates**: `nixos-rebuild` provides atomic system updates
|
||||
- **Rollbacks**: Nix generations for automatic rollback
|
||||
- **Reproducibility**: Identical configurations across environments
|
||||
|
||||
#### 2. Lab Tool Focus
|
||||
- **Orchestration**: Coordinate updates across multiple machines
|
||||
- **Sequencing**: Handle reboot ordering and dependencies
|
||||
- **Monitoring**: Health checks and status reporting
|
||||
- **Communication**: SSH coordination and logging
|
||||
|
||||
#### 3. Reduced Complexity
|
||||
- **No Duplication**: Don't reimplement what NixOS provides
|
||||
- **Native Integration**: Work with NixOS's natural patterns
|
||||
- **Maintainability**: Less custom code to maintain
|
||||
- **Ecosystem**: Leverage existing NixOS modules and community
|
||||
|
||||
### Migration Strategy
|
||||
|
||||
#### Phase 1: Add Lab Module to NixOS
|
||||
1. Create `lab-machine.nix` module
|
||||
2. Add to each machine configuration
|
||||
3. Test metadata extraction
|
||||
|
||||
#### Phase 2: Update Lab Tool
|
||||
1. Replace custom config with NixOS integration
|
||||
2. Update commands to read from NixOS configs
|
||||
3. Test orchestration with new metadata
|
||||
|
||||
#### Phase 3: Enhanced Features
|
||||
1. Add more sophisticated orchestration
|
||||
2. Integrate with NixOS deployment tools
|
||||
3. Add monitoring and alerting
|
||||
|
||||
### Example: Simplified Orchestrator
|
||||
```nix
|
||||
# The orchestrator service becomes much simpler
|
||||
systemd.services.lab-orchestrator = {
|
||||
script = ''
|
||||
# Update flake
|
||||
nix flake update
|
||||
|
||||
# Get reboot sequence from NixOS configs
|
||||
SEQUENCE=$(lab get-reboot-sequence)
|
||||
|
||||
# Deploy to all machines
|
||||
lab deploy-all
|
||||
|
||||
# Execute reboot sequence
|
||||
for machine_delay in $SEQUENCE; do
|
||||
machine=$(echo $machine_delay | cut -d: -f1)
|
||||
delay=$(echo $machine_delay | cut -d: -f2)
|
||||
|
||||
sleep $delay
|
||||
lab reboot $machine
|
||||
done
|
||||
'';
|
||||
};
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
By leveraging NixOS's existing configuration system instead of reinventing it, we get:
|
||||
|
||||
- **Less code to maintain**
|
||||
- **Better integration with the Nix ecosystem**
|
||||
- **Validation and type safety from Nix**
|
||||
- **Standard NixOS patterns and practices**
|
||||
- **Focus on actual orchestration needs**
|
||||
|
||||
The lab tool becomes a **coordination layer** rather than a configuration management system, which is exactly what you need for homelab orchestration.
|
354
research/simple-auto-update-plan.md
Normal file
354
research/simple-auto-update-plan.md
Normal file
|
@ -0,0 +1,354 @@
|
|||
# Simple Lab Auto-Update Service Plan
|
||||
|
||||
## Overview
|
||||
A simple automated update service for the homelab that runs nightly via cron, updates Nix flakes, rebuilds systems, and reboots machines. Designed for homelab environments where uptime is only critical during day hours.
|
||||
|
||||
## Current Lab Tool Analysis
|
||||
Based on the existing lab tool structure, we need to integrate with:
|
||||
- Command structure and CLI interface
|
||||
- Machine inventory and management
|
||||
- Configuration handling
|
||||
- Logging and status reporting
|
||||
|
||||
## Simple Architecture
|
||||
|
||||
### Core Components
|
||||
1. **Nix Service Module** - NixOS service definition for the auto-updater
|
||||
2. **Lab Tool Integration** - New commands in the existing lab tool
|
||||
3. **Cron Scheduling** - Simple nightly execution
|
||||
4. **Update Script** - Core logic for update/reboot cycle
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### 1. Nix Service Module
|
||||
Create a NixOS service that integrates with the lab tool:
|
||||
|
||||
```nix
|
||||
# /home/geir/Home-lab/nix/modules/lab-auto-update.nix
|
||||
{ config, lib, pkgs, ... }:
|
||||
|
||||
with lib;
|
||||
|
||||
let
|
||||
cfg = config.services.lab-auto-update;
|
||||
labTool = pkgs.writeShellScript "lab-auto-update" ''
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
LOG_FILE="/var/log/lab-auto-update.log"
|
||||
|
||||
echo "$(date): Starting auto-update" >> "$LOG_FILE"
|
||||
|
||||
# Update flake
|
||||
lab update-system --self 2>&1 | tee -a "$LOG_FILE"
|
||||
|
||||
# Reboot if configured
|
||||
if [[ "${cfg.autoReboot}" == "true" ]]; then
|
||||
echo "$(date): Rebooting system" >> "$LOG_FILE"
|
||||
systemctl reboot
|
||||
fi
|
||||
'';
|
||||
in
|
||||
{
|
||||
options.services.lab-auto-update = {
|
||||
enable = mkEnableOption "Lab auto-update service";
|
||||
|
||||
schedule = mkOption {
|
||||
type = types.str;
|
||||
default = "02:00";
|
||||
description = "Time to run updates (HH:MM format)";
|
||||
};
|
||||
|
||||
autoReboot = mkOption {
|
||||
type = types.bool;
|
||||
default = true;
|
||||
description = "Whether to automatically reboot after updates";
|
||||
};
|
||||
|
||||
flakePath = mkOption {
|
||||
type = types.str;
|
||||
default = "/home/geir/Home-lab";
|
||||
description = "Path to the lab flake";
|
||||
};
|
||||
};
|
||||
|
||||
config = mkIf cfg.enable {
|
||||
systemd.services.lab-auto-update = {
|
||||
description = "Lab Auto-Update Service";
|
||||
serviceConfig = {
|
||||
Type = "oneshot";
|
||||
User = "root";
|
||||
ExecStart = "${labTool}";
|
||||
};
|
||||
};
|
||||
|
||||
systemd.timers.lab-auto-update = {
|
||||
description = "Lab Auto-Update Timer";
|
||||
timerConfig = {
|
||||
OnCalendar = "daily";
|
||||
Persistent = true;
|
||||
RandomizedDelaySec = "30m";
|
||||
};
|
||||
wantedBy = [ "timers.target" ];
|
||||
};
|
||||
|
||||
# Ensure log directory exists
|
||||
systemd.tmpfiles.rules = [
|
||||
"d /var/log 0755 root root -"
|
||||
];
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Lab Tool Commands
|
||||
Add new commands to the existing lab tool:
|
||||
|
||||
```python
|
||||
# lab/commands/update_system.py
|
||||
class UpdateSystemCommand:
|
||||
def __init__(self, lab_config):
|
||||
self.lab_config = lab_config
|
||||
self.flake_path = lab_config.get('flake_path', '/home/geir/Home-lab')
|
||||
|
||||
def update_self(self):
|
||||
"""Update the current system using Nix flake"""
|
||||
try:
|
||||
# Update flake inputs
|
||||
self._run_command(['nix', 'flake', 'update'], cwd=self.flake_path)
|
||||
|
||||
# Rebuild system
|
||||
hostname = self._get_hostname()
|
||||
self._run_command([
|
||||
'nixos-rebuild', 'switch',
|
||||
'--flake', f'{self.flake_path}#{hostname}'
|
||||
])
|
||||
|
||||
print("System updated successfully")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f"Update failed: {e}")
|
||||
return False
|
||||
|
||||
def schedule_reboot(self, delay_minutes=1):
|
||||
"""Schedule a system reboot"""
|
||||
self._run_command(['shutdown', '-r', f'+{delay_minutes}'])
|
||||
|
||||
def _get_hostname(self):
|
||||
import socket
|
||||
return socket.gethostname()
|
||||
|
||||
def _run_command(self, cmd, cwd=None):
|
||||
import subprocess
|
||||
result = subprocess.run(cmd, cwd=cwd, check=True,
|
||||
capture_output=True, text=True)
|
||||
return result.stdout
|
||||
```
|
||||
|
||||
### 3. CLI Integration
|
||||
Extend the main lab tool CLI:
|
||||
|
||||
```python
|
||||
# lab/cli.py (additions)
|
||||
@cli.group()
|
||||
def update():
|
||||
"""System update commands"""
|
||||
pass
|
||||
|
||||
@update.command('system')
|
||||
@click.option('--self', 'update_self', is_flag=True,
|
||||
help='Update the current system')
|
||||
@click.option('--reboot', is_flag=True,
|
||||
help='Reboot after update')
|
||||
def update_system(update_self, reboot):
|
||||
"""Update system using Nix flake"""
|
||||
if update_self:
|
||||
updater = UpdateSystemCommand(config)
|
||||
success = updater.update_self()
|
||||
|
||||
if success and reboot:
|
||||
updater.schedule_reboot()
|
||||
```
|
||||
|
||||
### 4. Simple Configuration
|
||||
Add update settings to lab configuration:
|
||||
|
||||
```yaml
|
||||
# lab.yaml (additions)
|
||||
auto_update:
|
||||
enabled: true
|
||||
schedule: "02:00"
|
||||
auto_reboot: true
|
||||
flake_path: "/home/geir/Home-lab"
|
||||
log_retention_days: 30
|
||||
```
|
||||
|
||||
## Deployment Strategy
|
||||
|
||||
### Per-Machine Setup
|
||||
Each machine gets the service enabled in its Nix configuration:
|
||||
|
||||
```nix
|
||||
# hosts/<hostname>/configuration.nix
|
||||
{
|
||||
imports = [
|
||||
../../nix/modules/lab-auto-update.nix
|
||||
];
|
||||
|
||||
services.lab-auto-update = {
|
||||
enable = true;
|
||||
schedule = "02:00";
|
||||
autoReboot = true;
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Staggered Scheduling
|
||||
Different machines can have different update times to avoid all rebooting simultaneously:
|
||||
|
||||
```nix
|
||||
# Example configurations
|
||||
# db-server.nix
|
||||
services.lab-auto-update.schedule = "02:00";
|
||||
|
||||
# web-servers.nix
|
||||
services.lab-auto-update.schedule = "02:30";
|
||||
|
||||
# dev-machines.nix
|
||||
services.lab-auto-update.schedule = "03:00";
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Create Nix Module
|
||||
- Create the service module file
|
||||
- Add to common imports
|
||||
- Test on single machine
|
||||
|
||||
### Step 2: Extend Lab Tool
|
||||
- Add UpdateSystemCommand class
|
||||
- Integrate CLI commands
|
||||
- Test update functionality
|
||||
|
||||
### Step 3: Deploy Gradually
|
||||
- Enable on non-critical machines first
|
||||
- Monitor logs and behavior
|
||||
- Roll out to all machines
|
||||
|
||||
### Step 4: Monitoring Setup
|
||||
- Log rotation configuration
|
||||
- Status reporting
|
||||
- Alert on failures
|
||||
|
||||
## Safety Features
|
||||
|
||||
### Pre-Update Checks
|
||||
```bash
|
||||
# Basic health check before update
|
||||
if ! systemctl is-system-running --quiet; then
|
||||
echo "System not healthy, skipping update"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check disk space
|
||||
if [[ $(df / | tail -1 | awk '{print $5}' | sed 's/%//') -gt 90 ]]; then
|
||||
echo "Low disk space, skipping update"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### Rollback on Boot Failure
|
||||
```nix
|
||||
# Enable automatic rollback
|
||||
boot.loader.grub.configurationLimit = 10;
|
||||
systemd.services."rollback-on-failure" = {
|
||||
description = "Rollback on boot failure";
|
||||
serviceConfig = {
|
||||
Type = "oneshot";
|
||||
RemainAfterExit = true;
|
||||
};
|
||||
script = ''
|
||||
# This runs if we successfully boot
|
||||
# Clear any failure flags
|
||||
rm -f /var/lib/update-failed
|
||||
'';
|
||||
wantedBy = [ "multi-user.target" ];
|
||||
};
|
||||
```
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Log Management
|
||||
```nix
|
||||
# Add to service configuration
|
||||
services.logrotate.settings.lab-auto-update = {
|
||||
files = "/var/log/lab-auto-update.log";
|
||||
rotate = 30;
|
||||
daily = true;
|
||||
compress = true;
|
||||
missingok = true;
|
||||
notifempty = true;
|
||||
};
|
||||
```
|
||||
|
||||
### Status Reporting
|
||||
```python
|
||||
# lab/commands/status.py additions
|
||||
def update_status():
|
||||
"""Show auto-update status"""
|
||||
log_file = "/var/log/lab-auto-update.log"
|
||||
|
||||
if os.path.exists(log_file):
|
||||
# Parse last update attempt
|
||||
with open(log_file, 'r') as f:
|
||||
lines = f.readlines()
|
||||
# Show last few entries
|
||||
for line in lines[-10:]:
|
||||
print(line.strip())
|
||||
|
||||
# Show service status
|
||||
result = subprocess.run(['systemctl', 'status', 'lab-auto-update.timer'],
|
||||
capture_output=True, text=True)
|
||||
print(result.stdout)
|
||||
```
|
||||
|
||||
## Testing Plan
|
||||
|
||||
### Local Testing
|
||||
1. Test lab tool commands manually
|
||||
2. Test service creation and timer
|
||||
3. Verify logging works
|
||||
4. Test with dry-run options
|
||||
|
||||
### Gradual Rollout
|
||||
1. Enable on development machine first
|
||||
2. Monitor for one week
|
||||
3. Enable on infrastructure machines
|
||||
4. Finally enable on critical services
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Simple Additions
|
||||
- Email notifications on failure
|
||||
- Webhook status reporting
|
||||
- Update statistics tracking
|
||||
- Configuration validation
|
||||
|
||||
### Advanced Features
|
||||
- Update coordination between machines
|
||||
- Dependency-aware scheduling
|
||||
- Emergency update capabilities
|
||||
- Integration with monitoring systems
|
||||
|
||||
## File Structure
|
||||
```
|
||||
/home/geir/Home-lab/
|
||||
├── nix/modules/lab-auto-update.nix
|
||||
├── lab/commands/update_system.py
|
||||
├── lab/cli.py (modified)
|
||||
└── scripts/
|
||||
├── update-health-check.sh
|
||||
└── emergency-rollback.sh
|
||||
```
|
||||
|
||||
This plan provides a simple, reliable auto-update system that leverages the existing lab tool infrastructure while keeping complexity minimal for a homelab environment.
|
402
research/staggered-update-system.md
Normal file
402
research/staggered-update-system.md
Normal file
|
@ -0,0 +1,402 @@
|
|||
# Staggered Machine Update and Reboot System Research
|
||||
|
||||
## Overview
|
||||
|
||||
Research into implementing an automated system for updating and rebooting all lab machines in a staggered fashion using Nix, cronjobs, and our existing lab tool infrastructure.
|
||||
|
||||
## Goals
|
||||
|
||||
- Minimize downtime by updating machines in waves
|
||||
- Ensure system stability with gradual rollouts
|
||||
- Leverage Nix's atomic updates and rollback capabilities
|
||||
- Integrate with existing lab tool for orchestration
|
||||
- Provide monitoring and failure recovery
|
||||
|
||||
## Architecture Components
|
||||
|
||||
### 1. Update Controller
|
||||
|
||||
- Central orchestrator running on management node
|
||||
- Maintains machine groups and update schedules
|
||||
- Coordinates staggered execution
|
||||
- Monitors update progress and health
|
||||
|
||||
### 2. Machine Groups
|
||||
|
||||
```
|
||||
Group 1: Non-critical services (dev environments, testing)
|
||||
Group 2: Infrastructure services (monitoring, logging)
|
||||
Group 3: Critical services (databases, core applications)
|
||||
Group 4: Management nodes (controllers, orchestrators)
|
||||
```
|
||||
|
||||
### 3. Nix Integration
|
||||
|
||||
- Use `nixos-rebuild switch` for atomic updates
|
||||
- Leverage Nix generations for rollback capability
|
||||
- Update channels/flakes before rebuilding
|
||||
- Validate configuration before applying
|
||||
|
||||
### 4. Lab Tool Integration
|
||||
|
||||
- Extend lab tool with update management commands
|
||||
- Machine inventory and grouping
|
||||
- Health check integration
|
||||
- Status reporting and logging
|
||||
|
||||
## Implementation Strategy
|
||||
|
||||
### Phase 1: Basic Staggered Updates
|
||||
|
||||
```bash
|
||||
# Example workflow per machine group
|
||||
lab update prepare --group=dev
|
||||
lab update execute --group=dev --wait-for-completion
|
||||
lab update verify --group=dev
|
||||
lab update prepare --group=infrastructure
|
||||
# Continue with next group...
|
||||
```
|
||||
|
||||
### Phase 2: Enhanced Orchestration
|
||||
|
||||
- Dependency-aware scheduling
|
||||
- Health checks before proceeding to next group
|
||||
- Automatic rollback on failures
|
||||
- Notification system
|
||||
|
||||
### Phase 3: Advanced Features
|
||||
|
||||
- Blue-green deployments for critical services
|
||||
- Canary releases
|
||||
- Integration with monitoring systems
|
||||
- Custom update policies per service
|
||||
|
||||
## Cronjob Design
|
||||
|
||||
### Master Cron Schedule
|
||||
|
||||
```cron
|
||||
# Weekly full system update - Sundays at 2 AM
|
||||
0 2 * * 0 /home/geir/Home-lab/scripts/staggered-update.sh
|
||||
|
||||
# Daily security updates for critical machines
|
||||
0 3 * * * /home/geir/Home-lab/scripts/security-update.sh --group=critical
|
||||
|
||||
# Health check and cleanup
|
||||
0 1 * * * /home/geir/Home-lab/scripts/update-health-check.sh
|
||||
```
|
||||
|
||||
### Update Script Structure
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# staggered-update.sh
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
GROUPS=("dev" "infrastructure" "critical" "management")
|
||||
STAGGER_DELAY=30m
|
||||
MAX_PARALLEL=3
|
||||
|
||||
# Log setup
|
||||
LOG_DIR="/var/log/lab-updates"
|
||||
LOG_FILE="$LOG_DIR/update-$(date +%Y%m%d-%H%M%S).log"
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
|
||||
for group in "${GROUPS[@]}"; do
|
||||
echo "Starting update for group: $group"
|
||||
|
||||
# Pre-update checks
|
||||
lab health-check --group="$group" || {
|
||||
echo "Health check failed for $group, skipping"
|
||||
continue
|
||||
}
|
||||
|
||||
# Update Nix channels/flakes
|
||||
lab update prepare --group="$group"
|
||||
|
||||
# Execute updates with parallelism control
|
||||
lab update execute --group="$group" --parallel="$MAX_PARALLEL"
|
||||
|
||||
# Verify updates
|
||||
lab update verify --group="$group" || {
|
||||
echo "Verification failed for $group, initiating rollback"
|
||||
lab update rollback --group="$group"
|
||||
# Send alert
|
||||
lab notify --level=error --message="Update failed for $group, rolled back"
|
||||
exit 1
|
||||
}
|
||||
|
||||
echo "Group $group updated successfully, waiting $STAGGER_DELAY"
|
||||
sleep "$STAGGER_DELAY"
|
||||
done
|
||||
|
||||
echo "All groups updated successfully"
|
||||
lab notify --level=info --message="Staggered update completed successfully"
|
||||
```
|
||||
|
||||
## Nix Configuration Management
|
||||
|
||||
### Centralized Configuration
|
||||
|
||||
```nix
|
||||
# /home/geir/Home-lab/nix/update-config.nix
|
||||
{
|
||||
updateGroups = {
|
||||
dev = {
|
||||
machines = [ "dev-01" "dev-02" "test-env" ];
|
||||
updatePolicy = "aggressive";
|
||||
maintenanceWindow = "02:00-06:00";
|
||||
allowReboot = true;
|
||||
};
|
||||
|
||||
infrastructure = {
|
||||
machines = [ "monitor-01" "log-server" "backup-01" ];
|
||||
updatePolicy = "conservative";
|
||||
maintenanceWindow = "03:00-05:00";
|
||||
allowReboot = true;
|
||||
dependencies = [ "dev" ];
|
||||
};
|
||||
|
||||
critical = {
|
||||
machines = [ "db-primary" "web-01" "web-02" ];
|
||||
updatePolicy = "manual-approval";
|
||||
maintenanceWindow = "04:00-05:00";
|
||||
allowReboot = false; # Requires manual reboot
|
||||
dependencies = [ "infrastructure" ];
|
||||
};
|
||||
};
|
||||
|
||||
updateSettings = {
|
||||
maxParallel = 3;
|
||||
healthCheckTimeout = 300;
|
||||
rollbackOnFailure = true;
|
||||
notificationChannels = [ "email" "discord" ];
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Machine-Specific Update Configurations
|
||||
|
||||
```nix
|
||||
# On each machine: /etc/nixos/update-config.nix
|
||||
{
|
||||
services.lab-updater = {
|
||||
enable = true;
|
||||
group = "infrastructure";
|
||||
preUpdateScript = ''
|
||||
# Stop non-critical services
|
||||
systemctl stop some-service
|
||||
'';
|
||||
postUpdateScript = ''
|
||||
# Restart services and verify
|
||||
systemctl start some-service
|
||||
curl -f http://localhost:8080/health
|
||||
'';
|
||||
rollbackScript = ''
|
||||
# Custom rollback procedures
|
||||
systemctl stop some-service
|
||||
nixos-rebuild switch --rollback
|
||||
systemctl start some-service
|
||||
'';
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
## Lab Tool Extensions
|
||||
|
||||
### New Commands
|
||||
|
||||
```bash
|
||||
# Update management
|
||||
lab update prepare [--group=GROUP] [--machine=MACHINE]
|
||||
lab update execute [--group=GROUP] [--parallel=N] [--dry-run]
|
||||
lab update verify [--group=GROUP]
|
||||
lab update rollback [--group=GROUP] [--to-generation=N]
|
||||
lab update status [--group=GROUP]
|
||||
|
||||
# Health and monitoring
|
||||
lab health-check [--group=GROUP] [--timeout=SECONDS]
|
||||
lab update-history [--group=GROUP] [--days=N]
|
||||
lab notify [--level=LEVEL] [--message=MSG] [--channel=CHANNEL]
|
||||
|
||||
# Configuration
|
||||
lab update-config show [--group=GROUP]
|
||||
lab update-config set [--group=GROUP] [--key=KEY] [--value=VALUE]
|
||||
```
|
||||
|
||||
### Integration Points
|
||||
|
||||
```python
|
||||
# lab/commands/update.py
|
||||
class UpdateCommand:
|
||||
def prepare(self, group=None, machine=None):
|
||||
"""Prepare machines for updates"""
|
||||
# Update Nix channels/flakes
|
||||
# Pre-update health checks
|
||||
# Download packages
|
||||
|
||||
def execute(self, group=None, parallel=1, dry_run=False):
|
||||
"""Execute updates on machines"""
|
||||
# Run nixos-rebuild switch
|
||||
# Monitor progress
|
||||
# Handle failures
|
||||
|
||||
def verify(self, group=None):
|
||||
"""Verify updates completed successfully"""
|
||||
# Check system health
|
||||
# Verify services
|
||||
# Compare generations
|
||||
```
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Health Checks
|
||||
|
||||
- Service availability checks
|
||||
- Resource usage monitoring
|
||||
- System log analysis
|
||||
- Network connectivity tests
|
||||
|
||||
### Alerting Triggers
|
||||
|
||||
- Update failures
|
||||
- Health check failures
|
||||
- Rollback events
|
||||
- Long-running updates
|
||||
|
||||
### Notification Channels
|
||||
|
||||
- Email notifications
|
||||
- Discord/Slack integration
|
||||
- Dashboard updates
|
||||
- Log aggregation
|
||||
|
||||
## Safety Mechanisms
|
||||
|
||||
### Pre-Update Validation
|
||||
|
||||
- Configuration syntax checking
|
||||
- Dependency verification
|
||||
- Resource availability checks
|
||||
- Backup verification
|
||||
|
||||
### During Update
|
||||
|
||||
- Progress monitoring
|
||||
- Timeout handling
|
||||
- Partial failure recovery
|
||||
- Emergency stop capability
|
||||
|
||||
### Post-Update
|
||||
|
||||
- Service verification
|
||||
- Performance monitoring
|
||||
- Automatic rollback triggers
|
||||
- Success confirmation
|
||||
|
||||
## Rollback Strategy
|
||||
|
||||
### Automatic Rollback Triggers
|
||||
|
||||
- Health check failures
|
||||
- Service startup failures
|
||||
- Critical error detection
|
||||
- Timeout exceeded
|
||||
|
||||
### Manual Rollback
|
||||
|
||||
```bash
|
||||
# Quick rollback to previous generation
|
||||
lab update rollback --group=critical --immediate
|
||||
|
||||
# Rollback to specific generation
|
||||
lab update rollback --group=infrastructure --to-generation=150
|
||||
|
||||
# Selective rollback (specific machines)
|
||||
lab update rollback --machine=db-primary,web-01
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Development Environment
|
||||
|
||||
- Test updates in isolated environment
|
||||
- Validate scripts and configurations
|
||||
- Performance testing
|
||||
- Failure scenario testing
|
||||
|
||||
### Staging Rollout
|
||||
|
||||
- Deploy to staging group first
|
||||
- Automated testing suite
|
||||
- Manual verification
|
||||
- Production deployment
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Secure communication channels
|
||||
- Authentication for update commands
|
||||
- Audit logging
|
||||
- Access control for update scripts
|
||||
- Encrypted configuration storage
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Advanced Scheduling
|
||||
|
||||
- Maintenance window management
|
||||
- Business hour awareness
|
||||
- Holiday scheduling
|
||||
- Emergency update capabilities
|
||||
|
||||
### Intelligence Features
|
||||
|
||||
- Machine learning for optimal timing
|
||||
- Predictive failure detection
|
||||
- Automatic dependency discovery
|
||||
- Performance impact analysis
|
||||
|
||||
### Integration Expansions
|
||||
|
||||
- CI/CD pipeline integration
|
||||
- Cloud provider APIs
|
||||
- Container orchestration
|
||||
- Configuration management systems
|
||||
|
||||
## Implementation Roadmap
|
||||
|
||||
### Phase 1 (2 weeks)
|
||||
|
||||
- Basic staggered update script
|
||||
- Simple group management
|
||||
- Nix integration
|
||||
- Basic health checks
|
||||
|
||||
### Phase 2 (4 weeks)
|
||||
|
||||
- Lab tool integration
|
||||
- Advanced scheduling
|
||||
- Monitoring and alerting
|
||||
- Rollback mechanisms
|
||||
|
||||
### Phase 3 (6 weeks)
|
||||
|
||||
- Advanced features
|
||||
- Performance optimization
|
||||
- Extended integrations
|
||||
- Documentation and training
|
||||
|
||||
## Resources and References
|
||||
|
||||
- NixOS Manual: System Administration
|
||||
- Cron Best Practices
|
||||
- Blue-Green Deployment Patterns
|
||||
- Infrastructure as Code Principles
|
||||
- Monitoring and Observability Patterns
|
||||
|
||||
## Conclusion
|
||||
|
||||
A well-designed staggered update system will significantly improve lab maintenance efficiency while reducing risk. The combination of Nix's atomic updates, careful orchestration, and comprehensive monitoring provides a robust foundation for automated infrastructure management.
|
Loading…
Add table
Add a link
Reference in a new issue