home-lab/plan.md
Geir Okkenhaug Jerstad c8bee48ee3 Update plan.md: GNU Stow + literate Emacs approach, deploy-rs migration planning
- Phase 4: Restructured to use GNU Stow for regular dotfiles + literate programming for Emacs only
- Added comprehensive package structure for Stow deployment
- Elevated deploy-rs migration to high priority with detailed configuration examples
- Updated status to reflect 4/4 machines fully operational with complete service stack
- Added recent critical issue resolution documentation
- Updated next phase priorities to reflect new dotfiles approach
2025-06-07 16:44:11 +00:00

47 KiB

Home Lab Migration Plan

Current

├── machines/ │ ├── congenital-optimist/ (AMD workstation) │ │ ├── default.nix │ │ ├── hardware-configuration.nix │ │ └── About.org │ ├── sleeper-service/ (Intel Xeon E3-1230 V2 file server) │ ├── default.nix │ ├── hardware-configuration.nix │ └── About.org │ ├── reverse-proxy/ (edge/gateway server) │ │ ├── default.nix │ │ ├── hardware-configuration.nix │ │ └── About.org │ └── grey-area/ (application server) │ ├── default.nix │ ├── hardware-configuration.nix │ └── About.orgessment

CongenitalOptimist Machine

  • Current NixOS Version: 25.05
  • Hardware: AMD CPU/GPU, ZFS storage (zpool + stuffpool), NFS mounts
  • Desktop Environments: GNOME, Cosmic, Sway
  • Virtualization: libvirt, Incus, Podman
  • Configuration Style: Traditional NixOS (non-flakes)
  • Dotfiles Approach: Prefer Emacs org-mode with literate programming (no Home Manager)

Current Structure

Home-lab/
├── Machines/
│   ├── CongenitalOptimist/ (existing - AMD workstation)
│   │   ├── configuration.nix
│   │   ├── hardware-configuration.nix
│   │   └── About.org
│   └── Modules/ (existing modular structure)
│       ├── common/
│       │   ├── base.nix (modern CLI tools & aliases)
│       │   └── tty.nix (console styling)
│       └── virtualization/
│           ├── podman.nix
│           ├── libvirt.nix
│           └── incus.nix
└── Users/
    └── geir/
        └── user.nix (has typo: progtams → programs)

Target Structure (Post-Migration)

Home-lab/
├── flake.nix
├── flake.lock
├── machines/
│   ├── congenital-optimist/ (AMD workstation)
│   │   ├── default.nix
│   │   ├── hardware-configuration.nix
│   │   └── About.org
│   ├── sleeper-service/ (Intel Xeon E3-1230 V2 file server)
│       ├── default.nix
│       ├── hardware-configuration.nix
│       └── About.org
│   ├── reverse-proxy/ (edge/gateway server)
│   │   ├── default.nix
│   │   ├── hardware-configuration.nix
│   │   └── About.org
│   └── grey-area/ (application server)
│       ├── default.nix
│       ├── hardware-configuration.nix
│       └── About.org
├── modules/
│   ├── common/
│   ├── desktop/
│   ├── development/
│   ├── virtualization/
│   ├── services/
│   │   ├── nfs.nix
│   │   ├── samba.nix
│   │   ├── backup.nix
│   │   └── monitoring.nix
│   └── users/
│       └── common.nix (shared user configurations)
├── users/
│   └── geir/
│       ├── dotfiles/
│       │   ├── README.org (geir's literate config)
│       │   ├── emacs/
│       │   ├── shell/
│       │   └── editors/
│       └── user.nix (geir's system config)
├── overlays/
├── packages/
└── secrets/ (for future secrets management)

Deployment Status & Accomplishments

sleeper-service Deployment (COMPLETED)

Date: Recently completed
Status: Fully operational
Machine: Intel Xeon E3-1230 V2, 16GB RAM (formerly files.home)

Key Achievements:

  • Flake Migration: Successfully deployed NixOS flake configuration on remote machine
  • ZFS Stability: Resolved ZFS mounting conflicts causing boot failures
  • Data Preservation: All 903GB of media data intact and accessible
  • Network Integration: Added Pi-hole DNS (10.0.0.14) for package resolution
  • SSH Infrastructure: Implemented centralized SSH key management
  • Boot Performance: Clean boot in ~1 minute with ZFS auto-mounting enabled
  • Remote Deployment: Established rsync + SSH deployment workflow
  • NFS Server: Configured NFS exports for both local (10.0.0.0/24) and Tailscale (100.64.0.0/10) networks
  • Network Configuration: Updated to use Tailscale IPs for reliable mesh connectivity

Technical Solutions:

  • ZFS Native Mounting: Migrated from legacy mountpoints to ZFS native paths
  • Hardware Configuration: Removed conflicting ZFS filesystem entries
  • Graphics Compatibility: Added nomodeset kernel parameter, disabled NVIDIA drivers
  • DNS Configuration: Multi-tier DNS with Pi-hole primary, router and Google fallback
  • Deployment Method: Remote deployment via rsync + SSH instead of direct nixos-rebuild
  • NFS Exports: Resolved dataset conflicts by commenting out conflicting tmpfiles rules
  • Network Access: Added Tailscale interface (tailscale0) as trusted interface in firewall

Data Verified:

  • Storage Pool: 903GB used, 896GB available
  • Media Content: Films (184GB), Series (612GB), Audiobooks (94GB), Music (9.1GB), Books (3.5GB)
  • Mount Points: /mnt/storage and /mnt/storage/media with proper ZFS auto-mounting
  • NFS Access: Both datasets exported with proper permissions for network access

grey-area Deployment (COMPLETED) NEW

Date: June 2025
Status: Fully operational
Machine: Intel Xeon E5-2670 v3 (24 cores) @ 3.10 GHz, 31.24 GiB RAM

Key Achievements:

  • Flake Configuration: Successfully deployed NixOS flake-based configuration
  • NFS Client: Configured reliable NFS mount to sleeper-service media storage via Tailscale
  • Service Stack: Deployed comprehensive application server with multiple services
  • Network Integration: Integrated with centralized extraHosts module using Tailscale IPs
  • User Management: Resolved UID conflicts and implemented consistent user configuration
  • Firewall Configuration: Properly configured ports for all services

Services Deployed:

  • Jellyfin: Media server with access to NFS-mounted content from sleeper-service
  • Calibre-web: E-book management and reading interface
  • Forgejo: Git hosting server (git.geokkjer.eu) with reverse proxy integration
  • Audiobook Server: Audiobook streaming and management

Technical Implementation:

  • NFS Mount: /mnt/remote/media successfully mounting sleeper-service:/mnt/storage/media
  • Network Path: Using Tailscale mesh (100.x.x.x) for reliable connectivity
  • Mount Options: Configured with automount, soft mount, and appropriate timeouts
  • Firewall Ports: 22 (SSH), 3000 (Forgejo), 23231 (other services)
  • User Configuration: Fixed UID consistency with centralized sma user module

Data Access Verified:

  • Movies: 38 films accessible via NFS
  • TV Series: 29 series collections
  • Music: 9 music directories
  • Audiobooks: 79 audiobook collections
  • Books: E-book collection
  • Media Services: All content accessible through Jellyfin and other services

reverse-proxy Integration (COMPLETED) NEW

Date: June 2025
Status: Fully operational
Machine: External VPS (46.226.104.98)

Key Achievements:

  • Nginx Configuration: Successfully configured reverse proxy for Forgejo
  • Hostname Resolution: Fixed hostname mapping from incorrect "apps" to correct "grey-area"
  • SSL/TLS: Configured ACME Let's Encrypt certificate for git.geokkjer.eu
  • SSH Forwarding: Configured SSH proxy on port 1337 for Git operations
  • Network Security: Implemented DMZ-style security with Tailscale-only SSH access

Technical Configuration:

  • HTTP Proxy: git.geokkjer.euhttp://grey-area:3000 (Forgejo)
  • SSH Proxy: Port 1337 → grey-area:22 for Git SSH operations
  • Network Path: External traffic → reverse-proxy → Tailscale → grey-area
  • Security: SSH restricted to Tailscale network, fail2ban protection
  • DNS: Proper hostname resolution via extraHosts module

Centralized Network Configuration (COMPLETED) NEW

Date: June 2025
Status: Fully operational

Key Achievements:

  • extraHosts Module: Created centralized hostname resolution using Tailscale IPs
  • Network Consistency: All machines use same IP mappings for reliable mesh connectivity
  • SSH Configuration: Updated IP addresses in ssh-keys.nix module
  • User Management: Resolved user configuration conflicts between modules

Network Topology:

  • Tailscale Mesh IPs:
    • 100.109.28.53 - congenital-optimist (workstation)
    • 100.81.15.84 - sleeper-service (NFS file server)
    • 100.119.86.92 - grey-area (application server)
    • 100.96.189.104 - reverse-proxy (external VPS)
    • 100.103.143.108 - pihole (DNS server)
    • 100.126.202.40 - wordpresserver (legacy)

Module Integration:

  • extraHosts: Added to all machine configurations for consistent hostname resolution
  • SSH Keys: Updated IP addresses (grey-area: 10.0.0.12, reverse-proxy: 46.226.104.98)
  • User Modules: Fixed conflicts between sma user definitions in different modules

Home Lab Deployment Tool (COMPLETED) NEW

Date: Recently completed
Status: Fully operational
Tool: lab command - Custom deployment management system

Key Achievements:

  • Custom Package Creation: Developed home-lab-tools.nix package with comprehensive deployment functionality
  • System Integration: Added lab command to system packages via modules/system/applications.nix
  • Conflict Resolution: Resolved shell alias conflict by renaming "lab" alias to "home-lab"
  • Multi-Machine Support: Deployment capabilities for sleeper-service, grey-area, and reverse-proxy
  • Status Monitoring: Infrastructure connectivity checking with color-coded output
  • Deployment Modes: Support for boot, test, and switch deployment modes

Technical Implementation:

  • Package Structure: Custom Nix package using writeShellScriptBin with proper dependencies
  • Color-Coded Logging: Blue info, green success, yellow warnings, red errors for clear output
  • SSH Infrastructure: Leverages existing SSH key management for secure remote deployment
  • Rsync Deployment: Efficient configuration syncing to target machines
  • Error Handling: Comprehensive error checking and validation throughout deployment process
  • Service Detection: Proper Tailscale service monitoring with tailscaled detection

Available Commands:

  • lab status: Check connectivity to all infrastructure machines
  • lab deploy <machine> [mode]: Deploy configuration to specific machine
    • Machines: sleeper-service, grey-area, reverse-proxy
    • Modes: boot (default), test (temporary), switch (permanent)
  • Help System: Built-in usage documentation and examples

Deployment Workflow:

  1. Configuration Sync: Uses rsync to transfer entire Home-lab directory to target machine
  2. Remote Execution: SSH into target machine and execute nixos-rebuild with flake
  3. Validation: Checks deployment success and provides clear feedback
  4. Status Verification: Can verify deployment results with status command

Infrastructure Status Integration:

  • Local Machine: Checks Tailscale service status on congenital-optimist
  • Remote Machines: SSH connectivity testing with timeout handling
  • Network Topology: Integrates with existing Tailscale mesh network
  • Service Monitoring: Foundation for future comprehensive monitoring system

Usage Examples:

lab status                          # Check all machine connectivity
lab deploy sleeper-service boot     # Deploy and set for next boot
lab deploy grey-area switch         # Deploy and activate immediately
lab deploy reverse-proxy test       # Deploy temporarily for testing

Technical Benefits:

  1. Centralized Deployment: Single command interface for all home lab machines
  2. Consistent Process: Standardized deployment workflow across infrastructure
  3. Error Prevention: Validation and safety checks prevent deployment failures
  4. Operational Visibility: Clear status reporting for infrastructure state
  5. Extensibility: Modular design allows easy addition of new machines and features
  6. Integration: Seamless integration with existing SSH and Tailscale infrastructure

Phase 1: Flakes Migration (Priority: High)

1.1 Create Flake Foundation

  • Create flake.nix at repository root
  • Define nixpkgs input pinned to NixOS 25.05
  • Add nixpkgs-unstable for bleeding edge packages
  • Structure outputs for multiple machines (no Home Manager)
  • Fix inconsistent naming convention (machine directories to lowercase)
  • Update flake outputs to use correct lowercase paths

1.2 Restructure Configuration

  • Convert configuration.nix to flake-compatible format
  • Keep system.stateVersion as "23.11" (maintains data compatibility)
  • Update existing module imports for flake structure
  • Integrate existing user configuration properly
  • Fix nerd-fonts syntax for 25.05 compatibility
  • Fix hostname typo (congenial-optimist → congenital-optimist)

1.3 Consolidate User Configuration

  • Fix typo in users/geir/user.nix (progtams → programs) - Already correct
  • Merge duplicate user packages between main config and user module
  • Decide on package location strategy (system vs user level)
  • Ensure all existing functionality is preserved

1.4 Configuration Testing & Validation

  • Validate flake syntax with nix flake check
  • Test build without switching: nixos-rebuild build --flake
  • Test configuration: nixos-rebuild test --flake
  • Successfully tested modularized configuration with virtualization

1.5 Desktop Environment Modularization NEW

  • Split monolithic environments.nix into modular components:
    • common.nix - Shared desktop configuration (XDG portal, dbus)
    • gnome.nix - GNOME desktop environment with extensions
    • cosmic.nix - System76 Cosmic desktop environment
    • sway.nix - Sway window manager with Wayland tools
  • Update main configuration to use modular desktop imports
  • Test modular desktop configuration successfully

1.6 Virtualization Stack NEW

  • Add comprehensive virtualization support:
    • Incus - Modern container and VM management (replaces LXD)
    • Libvirt/QEMU - Full KVM virtualization with virt-manager
    • Podman - Rootless containers with Docker compatibility
  • Configure proper user groups (incus-admin, libvirt, podman)
  • Enable UEFI/OVMF support for modern VM guests
  • Test all virtualization services running successfully
  • Create rollback plan and ZFS snapshots
  • Switch to flake configuration permanently

1.7 GitOps Foundation & CI/CD Setup NEW

  • Initialize git repository for infrastructure as code
  • Create comprehensive .gitignore for NixOS/Nix projects
  • Set up initial commit with current modular configuration
  • Plan CI/CD pipeline for configuration validation
  • Design branch strategy for infrastructure changes
  • Create templates for pull request workflows
  • Plan automated testing for configuration changes
  • Set up secrets management strategy for CI/CD
  • Document GitOps workflow for multi-machine deployments

1.8 Additional Migration Tasks

  • Update all documentation files to use consistent naming
  • Update flake descriptions and comments for clarity
  • Verify all module imports work correctly in new structure
  • Modularize congenital-optimist configuration into logical modules
  • Clean up any remaining references to old PascalCase paths
  • Test that existing aliases and CLI tools still work
  • Verify desktop environments (GNOME, Cosmic, Sway) all function
  • Test virtualization stack (podman, libvirt, incus) functionality
  • Validate ZFS and storage configuration compatibility
  • Generate and commit flake.lock file
  • Create backup of current working configuration before final switch

Phase 2: Configuration Cleanup & Organization

2.1 Optimize Current Modular Structure

  • Review and optimize existing common/base.nix tools
  • Enhance common/tty.nix console configuration
  • Validate virtualization modules are complete
  • Create desktop environment modules (separate GNOME, Cosmic, Sway)
  • Separate development tools into dedicated module

2.2 Target Directory Structure

Home-lab/
├── flake.nix
├── flake.lock
├── machines/
│   ├── congenital-optimist/ (AMD workstation)
│   │   ├── default.nix (main machine config)
│   │   ├── hardware-configuration.nix
│   │   └── About.org
│   ├── sleeper-service/ (Intel Xeon file server)
│   │   ├── default.nix (file server config)
│   │   ├── hardware-configuration.nix
│   │   └── About.org
│   ├── reverse-proxy/ (edge/gateway server)
│   │   ├── default.nix
│   │   ├── hardware-configuration.nix
│   │   └── About.org
│   └── grey-area/ (application server)
│       ├── default.nix
│       ├── hardware-configuration.nix
│       └── About.org
├── modules/
│   ├── common/
│   │   ├── base.nix (existing modern CLI tools)
│   │   ├── tty.nix (existing console config)
│   │   └── nix.nix (flakes + experimental features)
│   ├── desktop/
│   │   ├── gnome.nix
│   │   ├── cosmic.nix
│   │   └── sway.nix
│   ├── development/
│   │   ├── editors.nix (emacs, neovim, vscode, etc.)
│   │   ├── languages.nix (rust, python, LSPs)
│   │   └── tools.nix
│   ├── virtualization/ (existing)
│   │   ├── podman.nix
│   │   ├── libvirt.nix
│   │   └── incus.nix
│   ├── services/ (for SleeperService + grey-area)
│   │   ├── nfs.nix (network file sharing)
│   │   ├── samba.nix (windows compatibility)
│   │   ├── backup.nix (automated backups)
│   │   ├── monitoring.nix (system monitoring)
│   │   ├── storage.nix (ZFS/RAID management)
│   │   ├── reverse-proxy.nix (nginx/traefik configuration)
│   │   ├── forgejo.nix (git hosting and CI/CD)
│   │   ├── media.nix (jellyfin configuration)
│   │   └── applications.nix (containerized services)
│   └── users/
│       └── common.nix (shared user configurations)
├── users/
│   └── geir/
│       ├── dotfiles/
│       │   ├── README.org (main literate config)
│       │   ├── emacs/
│       │   ├── shell/
│       │   └── editors/
│       └── user.nix (consolidated user config)
├── overlays/
├── packages/
└── secrets/ (for future secrets management)

2.3 Network Infrastructure Updates

  • Network topology discovery: Used nmap to map actual network layout
    • Network Range: 10.0.0.0/24 (not 192.168.1.x as initially assumed)
    • Gateway: 10.0.0.138 (lan.home - router/firewall)
    • DNS Server: 10.0.0.14 (pi.hole - Pi-hole ad-blocker)
    • Current File Server: 10.0.0.8 (files.home - will be renamed to sleeper-service)
    • Machine Migration: sleeper-service is the existing files.home machine, not a new deployment
  • sleeper-service systemd-networkd migration: COMPLETED and DEPLOYED
    • Hostname transition: Successfully renamed from files.home to sleeper-service
    • Static IP preserved: Maintained 10.0.0.8/24 with gateway 10.0.0.138
    • DNS integration: Pi-hole primary (10.0.0.14), router fallback (10.0.0.138), Google DNS (8.8.8.8)
    • Network stack: networking.useNetworkd = true with networking.useDHCP = false
    • Interface configuration: enp0s25 configured with declarative static IPv4
    • Service ports: File server ports configured (NFS: 111,2049; SMB: 139,445; NetBIOS: 137,138)
    • Production validation: Network configuration tested and operational
  • Network standardization: Plan consistent networkd configuration across all server role machines workstation and laptop can use networkmanager
  • IP address allocation: Document static IP assignments for each service
    • Local Network (10.0.0.0/24):
      • 10.0.0.2: arlaptop.home (existing laptop)
      • 10.0.0.3: congenital-optimist (AMD workstation - current machine)
      • 10.0.0.8: sleeper-service (Intel Xeon file server - rename from files.home)
      • 10.0.0.11: grey-area (planned application server)
      • 10.0.0.12: reverse-proxy (planned edge server)
      • 10.0.0.14: pi.hole (Pi-hole DNS/ad-blocker) maybe move to nixos
      • 10.0.0.90: wordpresserver.home (existing WordPress server) to be deleted, incus container
      • 10.0.0.117: webdev.home (existing web development server) to be deleted, incus container
      • 10.0.0.138: lan.home (router/gateway/dhcp)
    • Tailscale Network (100.x.x.x/10):
      • 100.109.28.53: congenital-optimist (current machine)
      • 100.119.86.92: apps (active server) (rename to grey area)
      • 100.114.185.71: arlaptop (laptop) (Arch Linux with plans to migrate to NixOS)
      • 100.81.15.84: files (file server rename to sleeper-service )
      • 100.103.143.108: pihole (DNS server)
      • 100.96.189.104: vps1 (external VPS) (rename to reverse proxy)
      • 100.126.202.40: wordpresserver (WordPress) to be deleted
      • remind user to update tailsce or find a way to use the cli to do this
  • VLAN planning: Consider network segmentation for different service types
  • DNS configuration: Plan local DNS resolution for internal services

Phase 3: System Upgrade & Validation

3.1 Pre-upgrade Preparation

  • Backup current system configuration
  • Document current package versions
  • Create ZFS snapshots of all datasets
  • Test flake build without switching
  • Verify all existing modules work in flake context

3.2 Upgrade Execution

  • Switch to flake-based configuration
  • Upgrade to NixOS 25.05
  • Validate all services start correctly
  • Test desktop environments functionality
  • Verify virtualization stack
  • Check user environment and packages

3.3 Post-upgrade Validation

  • Verify all applications launch
  • Test development tools (editors, LSPs, compilers)
  • Validate container and VM functionality
  • Check ZFS and NFS mount operations
  • Verify shell environment and modern CLI tools work
  • Test console theming and TTY setup

Phase 4: Dotfiles & Configuration Management

4.1 GNU Stow Infrastructure for Regular Dotfiles DECIDED

Approach: Use GNU Stow for traditional dotfiles, literate programming for Emacs only

GNU Stow Setup

  • Create ~/dotfiles/ directory structure with package-based organization
  • Set up core packages: zsh/, git/, tmux/, starship/, etc.
  • Configure selective deployment per machine (workstation vs servers)
  • Create stow deployment scripts for different machine profiles
  • Document stow workflow and package management

Package Structure

~/dotfiles/              # Stow directory (target: $HOME)
├── zsh/                # Shell configuration
│   ├── .zshrc
│   ├── .zshenv
│   └── .config/zsh/
├── git/                # Git configuration
│   ├── .gitconfig
│   └── .config/git/
├── starship/           # Prompt configuration
│   └── .config/starship.toml
├── tmux/               # Terminal multiplexer
│   └── .tmux.conf
├── emacs/              # Basic Emacs bootstrap (points to literate config)
│   └── .emacs.d/early-init.el
└── machine-specific/   # Per-machine configurations
    ├── workstation/
    └── server/

4.2 Literate Programming for Emacs Configuration DECIDED

Approach: Comprehensive org-mode literate configuration for Emacs only

Emacs Literate Setup

  • Create ~/dotfiles/emacs/.emacs.d/configuration.org as master config
  • Set up automatic tangling on save (org-babel-tangle-on-save)
  • Modular org sections: packages, themes, keybindings, workflows
  • Bootstrap early-init.el to load tangled configuration
  • Create machine-specific customizations within org structure

Literate Configuration Structure

~/dotfiles/emacs/.emacs.d/
├── early-init.el       # Bootstrap (generated by Stow)
├── configuration.org   # Master literate config
├── init.el            # Tangled from configuration.org
├── modules/           # Tangled module files
│   ├── base.el
│   ├── development.el
│   ├── org-mode.el
│   └── ui.el
└── machine-config/    # Machine-specific overrides
    ├── workstation.el
    └── server.el

4.3 Integration Strategy

  • System-level: NixOS modules provide system packages and environment
  • User-level: GNU Stow manages dotfiles and application configurations
  • Emacs-specific: Org-mode literate programming for comprehensive Emacs setup
  • Per-machine: Selective stow packages + machine-specific customizations
  • Version control: Git repository for dotfiles with separate org documentation

4.4 Deployment Workflow

  • Create deployment scripts for different machine types:
    • Workstation: Full package deployment (zsh, git, tmux, starship, emacs)
    • Server: Minimal package deployment (zsh, git, basic emacs)
    • Development: Additional packages (language-specific tools, IDE configs)
  • Integration with existing lab deployment tool
  • Documentation for new user onboarding across machines

Phase 5: Home Lab Expansion Planning

5.1 Infrastructure Additions

Naming Convention

  • Machine Names: Culture ship names in PascalCase (e.g., CongenitalOptimist, SleeperService)

  • Folder Names: lowercase-with-hyphens (e.g., congenital-optimist/, sleeper-service/)

  • Flake Outputs: lowercase-with-hyphens (e.g., nixosConfigurations.congenital-optimist)

  • Hostnames: lowercase-with-hyphens (e.g., congenital-optimist, sleeper-service)

  • User Names: Culture character names in lowercase (e.g., sma, geir)

  • SleeperService file server (Intel Xeon E3-1230 V2, 16GB RAM): COMPLETED

    • NFS server for network storage (903GB ZFS pool operational)
    • ZFS storage with native mounting configuration
    • Flake-based NixOS deployment successful
    • SSH key management implemented
    • Network configuration with Pi-hole DNS integration
    • System boots cleanly in ~1 minute with ZFS auto-mounting
    • Data preservation verified (Films: 184GB, Series: 612GB, etc.)
    • NFS exports configured for both local and Tailscale networks
    • Resolved dataset conflicts and tmpfiles rule conflicts
    • Automated backup services (future enhancement)
    • System monitoring and alerting (future enhancement)
  • reverse-proxy edge server: COMPLETED

    • Nginx reverse proxy with proper hostname mapping (grey-area vs apps)
    • SSL/TLS termination with Let's Encrypt for git.geokkjer.eu
    • External access gateway with DMZ security configuration
    • SSH forwarding on port 1337 for Git operations
    • Fail2ban protection and Tailscale-only SSH access
    • Minimal attack surface, headless operation
  • grey-area application server (Culture GCU - versatile, multi-purpose): COMPLETED

    • Primary: Forgejo Git hosting (git.geokkjer.eu) with reverse proxy integration
    • Secondary: Jellyfin media server with NFS-mounted content
    • Additional: Calibre-web e-book server and audiobook streaming
    • Infrastructure: Container-focused (Podman), NFS client for media storage
    • Integration: Central Git hosting accessible externally via reverse proxy
    • Network: Integrated with Tailscale mesh and centralized hostname resolution
    • User Management: Resolved UID conflicts with centralized sma user configuration
    • Monitoring: TBD (future enhancement)
    • PostgreSQL: Plan database services for applications requiring persistent storage
  • Plan for additional users across machines:

    • geir - Primary user (development, desktop, daily use)
    • sma - Admin user (Diziet Sma, system administration, security oversight)
    • Service accounts for automation (forgejo-admin, backup-agent)
    • Guest accounts for temporary access
    • Culture character naming convention established
  • Network infrastructure planning: Started with sleeper-service systemd-networkd migration

  • Consider hardware requirements for future expansion

5.2 Services Architecture

  • Centralized configuration management
  • Per-user secrets management (agenix/sops-nix)
  • User-specific service configurations
  • Monitoring and logging (Prometheus, Grafana)
  • Backup strategy across machines and users
  • Container orchestration planning

5.3 Security & Networking

  • systemd-networkd migration: Completed for sleeper-service with static IP configuration
  • SSH key management centralization: IMPLEMENTED and DEPLOYED
    • Admin key (geir@geokkjer.eu-admin): For sma user, server administration access
    • Development key (geir@geokkjer.eu-dev): For geir user, git services, daily development
    • NixOS module: modules/security/ssh-keys.nix centralizes key management
    • SSH client config: Updated with role-based host patterns and key selection
    • Production deployment: Successfully deployed on sleeper-service
    • Security benefits: Principle of least privilege, limited blast radius if compromised
    • Usage examples:
      • ssh geir@sleeper-service.home - Uses dev key automatically
      • ssh admin-sleeper - Uses admin key for sma user access
      • git clone git@github.com:user/repo - Uses dev key for git operations
  • VPN configuration (Tailscale expansion)
  • Firewall rules standardization across machines
  • Certificate management (Let's Encrypt)
  • Network segmentation planning (VLANs for services vs. user devices)
  • DNS infrastructure (local DNS server for service discovery)

Phase 6: Advanced Features

6.1 Development Workflow

  • Devshells for different projects
  • Cachix setup for faster builds
  • CI/CD integration
  • Literate dotfiles with org-mode tangling automation

6.2 Automation & Maintenance

  • AI integration - development of a mcp server for the cluster
  • Automated system updates
  • Configuration validation tests
  • Deployment automation
  • Monitoring and alerting

6.3 Advanced Deployment Strategies RESEARCH COMPLETED

Deploy-rs Migration (Priority: High) 📋 RESEARCHED

  • Research deploy-rs capabilities COMPLETED

    • Rust-based deployment tool specifically designed for NixOS flakes
    • Features: parallel deployment, automatic rollback, health checks, SSH-based
    • Advanced capabilities: atomic deployments, magic rollback on failure
    • Profile management: system, user, and custom profiles support
    • Integration potential: Works with existing SSH keys and Tailscale network
  • Migration Planning: Transition from custom lab script to deploy-rs

    • Create deploy-rs configuration in flake.nix for all 4 machines
    • Configure nodes: sleeper-service, grey-area, reverse-proxy, congenital-optimist
    • Set up health checks for critical services (NFS, Forgejo, Jellyfin, nginx)
    • Test parallel deployment capabilities across infrastructure
    • Implement automatic rollback for failed deployments
    • Document migration benefits and new deployment workflow

Deploy-rs Configuration Structure

# flake.nix additions
deploy.nodes = {
  sleeper-service = {
    hostname = "100.81.15.84";  # Tailscale IP
    profiles.system.path = deploy-rs.lib.x86_64-linux.activate.nixos 
      self.nixosConfigurations.sleeper-service;
    profiles.system.user = "root";
  };
  grey-area = {
    hostname = "100.119.86.92";
    profiles.system.path = deploy-rs.lib.x86_64-linux.activate.nixos
      self.nixosConfigurations.grey-area;
    # Health checks for Forgejo, Jellyfin services
  };
  reverse-proxy = {
    hostname = "100.96.189.104";
    profiles.system.path = deploy-rs.lib.x86_64-linux.activate.nixos
      self.nixosConfigurations.reverse-proxy;
    # Health checks for nginx, SSL certificates
  };
};

Migration Benefits

  • Atomic deployments: Complete success or automatic rollback
  • Parallel deployment: Deploy to multiple machines simultaneously
  • Health checks: Validate services after deployment
  • Connection resilience: Better handling of SSH/network issues
  • Flake-native: Designed specifically for NixOS flake workflows
  • Safety: Magic rollback prevents broken deployments

Alternative: Guile Scheme Exploration (Priority: Low)

  • Research Guile Scheme for system administration
    • Evaluate functional deployment scripting patterns
    • Compare with current shell script and deploy-rs approaches
    • Consider integration with GNU Guix deployment patterns
    • Assess learning curve vs. practical benefits for home lab use case

6.4 Writeup

  • Take all the knowledge we have amassed and make a blog post or a series of blog posts

Phase 7: goin pro

  • A plan to generalise this project so it is usable for other people
  • A plan to make dashboard and web interface for the project

Timeline Estimates

  • Phase 1: 1-2 weeks (flakes migration)
  • Phase 2: 1 week (cleanup and organization)
  • Phase 3: 2-3 days (upgrade and validation)
  • Phase 4: 1 week (literate dotfiles setup)
  • Phase 5: 2-4 weeks (expansion planning and implementation)
  • Phase 6: Ongoing (advanced features as needed)

Risk Mitigation

Critical Risks

  1. Boot failure after upgrade: ZFS snapshots for quick rollback
  2. Desktop environment issues: Keep multiple DEs as fallback
  3. Virtualization breakage: Document current VM configurations
  4. Data loss: Multiple backup layers (ZFS, external)
  5. User environment regression: Backup existing dotfiles

Rollback Strategy

  • ZFS snapshot rollback capability
  • Keep old configuration.nix as reference
  • Maintain emergency boot media
  • Document manual recovery procedures
  • Preserve current user configuration during migration

Current Status Overview (Updated December 2024)

Infrastructure Deployment Status MAJOR MILESTONE ACHIEVED

PHASE 1: Flakes Migration - COMPLETED
PHASE 2: Configuration Cleanup - COMPLETED
PHASE 3: System Upgrade & Validation - COMPLETED
PHASE 5: Home Lab Expansion - 4/4 MACHINES FULLY OPERATIONAL 🎉

Machine Status

  • congenital-optimist: Development workstation (fully operational)
  • sleeper-service: NFS file server with 903GB media library (fully operational)
  • grey-area: Application server with Forgejo, Jellyfin, Calibre-web, audiobook server (fully operational)
  • reverse-proxy: External gateway with nginx, SSL termination, SSH forwarding (fully operational)

Network Architecture Status

  • Tailscale Mesh: All machines connected via secure mesh network (100.x.x.x addresses)
  • Hostname Resolution: Centralized extraHosts module deployed across all machines
  • NFS Storage: Reliable media storage access via Tailscale network (sleeper-service → grey-area)
  • External Access: Public services accessible via git.geokkjer.eu with SSL
  • SSH Infrastructure: Centralized key management with role-based access patterns
  • Firewall Configuration: Service ports properly configured across all machines

Services Status - FULLY OPERATIONAL STACK 🚀

  • Git Hosting: Forgejo operational at git.geokkjer.eu with SSH access on port 1337
  • Media Streaming: Jellyfin with NFS-mounted content library (38 movies, 29 TV series)
  • E-book Management: Calibre-web for book collections
  • Audiobook Streaming: Audiobook server with 79 audiobook collections
  • File Storage: NFS server with 903GB media library accessible across network
  • Web Gateway: Nginx reverse proxy with Let's Encrypt SSL and proper hostname mapping
  • User Management: Consistent UID/GID configuration across machines (sma user: 1001/992)

Infrastructure Achievements - COMPREHENSIVE DEPLOYMENT

  • NFS Mount Resolution: Fixed grey-area /mnt/storage/mnt/storage/media dataset access
  • Network Exports: Updated sleeper-service NFS exports for Tailscale network (100.64.0.0/10)
  • Service Discovery: Corrected reverse-proxy hostname mapping from "apps" to "grey-area"
  • Firewall Management: Added port 3000 for Forgejo service accessibility
  • SSH Forwarding: Configured SSH proxy on port 1337 for Git operations
  • SSL Termination: Let's Encrypt certificates working for git.geokkjer.eu
  • Data Verification: All media content accessible (movies, TV, music, audiobooks, books)
  • Deployment Tools: Custom lab command operational for infrastructure management

Current Operational Status

🟢 ALL CORE INFRASTRUCTURE DEPLOYED AND OPERATIONAL

  • 4/4 machines deployed with full service stack
  • External access verified: curl -I https://git.geokkjer.eu returns HTTP/2 200
  • NFS connectivity confirmed: Media files accessible across network via Tailscale
  • Service integration complete: Forgejo, Jellyfin, Calibre-web, audiobook server running
  • Network mesh stable: All machines connected via Tailscale with centralized hostname resolution

Next Phase Priorities

  • PHASE 4: GNU Stow + Literate Emacs Setup
    • Set up GNU Stow infrastructure for regular dotfiles (zsh, git, tmux, starship)
    • Create comprehensive Emacs literate configuration with org-mode
    • Implement selective deployment per machine type (workstation vs server)
    • Integration with existing NixOS system-level configuration
  • PHASE 6: Advanced Features & Deploy-rs Migration
    • Migrate from custom lab script to deploy-rs for improved deployment
    • Implement system monitoring and alerting infrastructure
    • Set up automated backup services for critical data
    • Create health checks and deployment validation
  • Documentation & Knowledge Sharing
    • Comprehensive blog post series documenting the full home lab journey
    • User guides for GNU Stow + literate Emacs configuration workflow
    • Deploy-rs migration guide and lessons learned
  • Future Enhancements
    • User ID consistency cleanup (sma user UID alignment across machines)
    • CI/CD integration with Forgejo for automated testing and deployment

Success Criteria

Core Infrastructure FULLY ACHIEVED 🎉

  • System boots reliably with flake configuration
  • All current functionality preserved
  • NixOS 25.05 running stable across all machines
  • Configuration is modular and maintainable
  • User environment fully functional with all packages
  • Modern CLI tools and aliases working
  • Console theming preserved
  • Virtualization stack operational
  • Multi-machine expansion completed (4/4 machines deployed)
  • Development workflow improved with Git hosting

Service Architecture FULLY ACHIEVED 🚀

  • NFS file server operational with reliable network access via Tailscale
  • Git hosting with external access via reverse proxy (git.geokkjer.eu)
  • Media services with shared storage backend (Jellyfin + 903GB library)
  • E-book and audiobook management services operational
  • Secure external access with SSL termination and SSH forwarding
  • Network mesh connectivity with centralized hostname resolution
  • All services verified operational and accessible externally

Network Integration FULLY ACHIEVED 🌐

  • Tailscale mesh network connecting all infrastructure machines
  • Centralized hostname resolution via extraHosts module
  • NFS file sharing working reliably over network
  • SSH key management with role-based access patterns
  • Firewall configuration properly securing all services
  • External domain (git.geokkjer.eu) with SSL certificates working

Outstanding Enhancement Goals 🔄

  • Literate dotfiles workflow established with org-mode
  • Documentation complete for future reference and blog writeup
  • System monitoring and alerting infrastructure (Prometheus/Grafana)
  • Automated deployment and maintenance improvements
  • Automated backup services for critical data
  • User ID consistency cleanup across machines

Infrastructure Notes

CongenitalOptimist (AMD Workstation)

  • Already has excellent modular structure
  • Modern CLI tools (eza, bat, ripgrep, etc.) already configured in base.nix
  • Console theming with Joker palette already implemented
  • User configuration needs cleanup (fix typo, consolidate packages)
  • ZFS configuration is solid and shouldn't need changes
  • Keep Tailscale configuration as network foundation
  • The AMD GPU setup should carry over cleanly to 25.05
  • Consider renaming hostname from "work" to "congenital-optimist" for consistency

SleeperService (Intel Xeon File Server)

  • Intel Xeon E3-1230 V2 @ 3.70GHz (4 cores, 8 threads)
  • 16GB RAM - adequate for file server operations
  • Perfect for reliable, background file serving tasks
  • Culture name fits: "massive GSV with reputation for taking unusual tasks"
  • Will handle NFS mounts currently served by external "files" server
  • Plan for ZFS or software RAID for data redundancy
  • Headless operation - no desktop environments needed
  • SSH-only access with robust monitoring

reverse-proxy (Edge Server)

  • Lightweight hardware requirements (can be modest specs)
  • Primary role: SSL/TLS termination and traffic routing
  • External-facing server with minimal attack surface
  • Nginx or Traefik for reverse proxy functionality
  • Let's Encrypt integration for automated certificate management
  • Fail2ban and security hardening
  • Routes traffic to internal services (grey-area, sleeper-service)

grey-area (Application Server - Culture GCU)

  • Hardware: Intel Xeon E5-2670 v3 (24 cores) @ 3.10 GHz, 31.24 GiB RAM
  • Primary Mission: Forgejo Git hosting and project management
  • Performance: Excellent specs for heavy containerized workloads and CI/CD
  • Container-focused architecture using Podman
  • PostgreSQL database for Forgejo
  • Concurrent multi-service deployment capability
  • Secondary services: Jellyfin (with transcoding), Nextcloud, Grafana
  • Integration hub for all home lab development projects
  • Culture name fits: "versatile ship handling varied, ambiguous tasks"
  • Central point for CI/CD pipelines and automation

Home Lab Philosophy

  • Emacs org-mode literate programming approach provides better control than Home Manager
  • Culture ship names create memorable, characterful infrastructure
  • Modular NixOS configuration allows easy machine additions
  • Per-user dotfiles structure scales across multiple machines
  • Tailscale provides secure network foundation for multi-machine setup

Recent Critical Issue Resolution (December 2024) 🔧

NFS Mount and Service Integration Issues - RESOLVED

  1. NFS Dataset Structure Resolution:

    • Problem: grey-area couldn't access media files via NFS mount
    • Root Cause: ZFS dataset structure confusion - mounting /mnt/storage vs /mnt/storage/media
    • Solution: Updated grey-area NFS mount from sleeper-service:/mnt/storage to sleeper-service:/mnt/storage/media
    • Result: All media content now accessible (38 movies, 29 TV series, 9 music albums, 79 audiobooks)
  2. NFS Network Export Configuration:

    • Problem: NFS exports only configured for local network (10.0.0.0/24)
    • Root Cause: Missing Tailscale network access in NFS exports
    • Solution: Updated sleeper-service NFS exports to include Tailscale network (100.64.0.0/10)
    • Result: Reliable NFS connectivity over Tailscale mesh network
  3. Conflicting tmpfiles Rules:

    • Problem: systemd tmpfiles creating conflicting directory structures for NFS exports
    • Root Cause: tmpfiles.d rules interfering with ZFS dataset mounting
    • Solution: Commented out conflicting tmpfiles rules in sleeper-service configuration
    • Result: Clean NFS export structure without mounting conflicts
  4. Forgejo Service Accessibility:

    • Problem: git.geokkjer.eu returning connection refused errors
    • Root Cause: Multiple issues - firewall ports, hostname mapping, SSH forwarding
    • Solutions Applied:
      • Added port 3000 to grey-area firewall configuration
      • Fixed reverse-proxy nginx configuration: http://apps:3000http://grey-area:3000
      • Updated SSH forwarding: apps:22grey-area:22 for port 1337
    • Result: External access verified - curl -I https://git.geokkjer.eu returns HTTP/2 200
  5. Hostname Resolution Consistency:

    • Problem: Inconsistent hostname references across configurations ("apps" vs "grey-area")
    • Root Cause: Legacy hostname references in reverse-proxy configuration
    • Solution: Updated all configurations to use consistent "grey-area" hostname
    • Result: Proper service discovery and reverse proxy routing
  6. User ID Consistency Challenge:

    • Current State: sma user has UID 1003 on grey-area vs 1001 on sleeper-service
    • Workaround: NFS access working via group permissions (users group: GID 100)
    • Future Fix: Implement centralized UID management across all machines

Recent Troubleshooting & Solutions (June 2025):

  1. NFS Dataset Structure: Proper understanding of ZFS dataset hierarchy crucial for NFS exports
    • /mnt/storage vs /mnt/storage/media dataset mounting differences
    • NFS exports must match actual ZFS dataset structure, not subdirectories
    • Client mount paths must align with server export paths for data access
  2. Network Transition Management: Tailscale vs local network connectivity during deployment
    • NFS exports need both local (10.0.0.0/24) and Tailscale (100.64.0.0/10) network access
    • extraHosts module provides consistent hostname resolution across network changes
    • Firewall configuration must accommodate service ports for external access
  3. Reverse Proxy Configuration: Hostname consistency critical for proxy functionality
    • nginx upstream configuration must use correct hostnames (grey-area not apps)
    • Service discovery relies on centralized hostname resolution modules
    • SSL certificate management works seamlessly with proper nginx configuration
  4. Service Integration: Multi-machine service architecture requires coordinated configuration
    • Forgejo deployment spans grey-area (service) + reverse-proxy (gateway) + DNS (domain)
    • NFS client/server coordination requires matching export/mount configurations
    • User ID consistency across machines essential for NFS file access permissions
  5. Firewall Management: Service-specific port configuration essential for functionality
    • Application servers need service ports opened (3000 for Forgejo, etc.)
    • Reverse proxy needs external ports (80, 443, 1337) and internal connectivity
    • SSH access coordination between local and Tailscale networks for security