# Home Lab Migration Plan ## Current ├── machines/ │ ├── congenital-optimist/ (AMD workstation) │ │ ├── default.nix │ │ ├── hardware-configuration.nix │ │ └── About.org │ ├── sleeper-service/ (Intel Xeon E3-1230 V2 file server) │ ├── default.nix │ ├── hardware-configuration.nix │ └── About.org │ ├── reverse-proxy/ (edge/gateway server) │ │ ├── default.nix │ │ ├── hardware-configuration.nix │ │ └── About.org │ └── grey-area/ (application server) │ ├── default.nix │ ├── hardware-configuration.nix │ └── About.orgessment ### CongenitalOptimist Machine - **Current NixOS Version**: 25.05 - **Hardware**: AMD CPU/GPU, ZFS storage (zpool + stuffpool), NFS mounts - **Desktop Environments**: GNOME, Cosmic, Sway - **Virtualization**: libvirt, Incus, Podman - **Configuration Style**: Traditional NixOS (non-flakes) - **Dotfiles Approach**: Prefer Emacs org-mode with literate programming (no Home Manager) ### Current Structure ``` Home-lab/ ├── Machines/ │ ├── CongenitalOptimist/ (existing - AMD workstation) │ │ ├── configuration.nix │ │ ├── hardware-configuration.nix │ │ └── About.org │ └── Modules/ (existing modular structure) │ ├── common/ │ │ ├── base.nix (modern CLI tools & aliases) │ │ └── tty.nix (console styling) │ └── virtualization/ │ ├── podman.nix │ ├── libvirt.nix │ └── incus.nix └── Users/ └── geir/ └── user.nix (has typo: progtams → programs) ``` ### Target Structure (Post-Migration) ``` Home-lab/ ├── flake.nix ├── flake.lock ├── machines/ │ ├── congenital-optimist/ (AMD workstation) │ │ ├── default.nix │ │ ├── hardware-configuration.nix │ │ └── About.org │ ├── sleeper-service/ (Intel Xeon E3-1230 V2 file server) │ ├── default.nix │ ├── hardware-configuration.nix │ └── About.org │ ├── reverse-proxy/ (edge/gateway server) │ │ ├── default.nix │ │ ├── hardware-configuration.nix │ │ └── About.org │ └── grey-area/ (application server) │ ├── default.nix │ ├── hardware-configuration.nix │ └── About.org ├── modules/ │ ├── common/ │ ├── desktop/ │ ├── development/ │ ├── virtualization/ │ ├── services/ │ │ ├── nfs.nix │ │ ├── samba.nix │ │ ├── backup.nix │ │ └── monitoring.nix │ └── users/ │ └── common.nix (shared user configurations) ├── users/ │ └── geir/ │ ├── dotfiles/ │ │ ├── README.org (geir's literate config) │ │ ├── emacs/ │ │ ├── shell/ │ │ └── editors/ │ └── user.nix (geir's system config) ├── overlays/ ├── packages/ └── secrets/ (for future secrets management) ``` ## Deployment Status & Accomplishments ✅ ### sleeper-service Deployment (COMPLETED) **Date**: Recently completed **Status**: ✅ Fully operational **Machine**: Intel Xeon E3-1230 V2, 16GB RAM (formerly files.home) #### Key Achievements: - **Flake Migration**: Successfully deployed NixOS flake configuration on remote machine - **ZFS Stability**: Resolved ZFS mounting conflicts causing boot failures - **Data Preservation**: All 903GB of media data intact and accessible - **Network Integration**: Added Pi-hole DNS (10.0.0.14) for package resolution - **SSH Infrastructure**: Implemented centralized SSH key management - **Boot Performance**: Clean boot in ~1 minute with ZFS auto-mounting enabled - **Remote Deployment**: Established rsync + SSH deployment workflow - **NFS Server**: Configured NFS exports for both local (10.0.0.0/24) and Tailscale (100.64.0.0/10) networks - **Network Configuration**: Updated to use Tailscale IPs for reliable mesh connectivity #### Technical Solutions: - **ZFS Native Mounting**: Migrated from legacy mountpoints to ZFS native paths - **Hardware Configuration**: Removed conflicting ZFS filesystem entries - **Graphics Compatibility**: Added `nomodeset` kernel parameter, disabled NVIDIA drivers - **DNS Configuration**: Multi-tier DNS with Pi-hole primary, router and Google fallback - **Deployment Method**: Remote deployment via rsync + SSH instead of direct nixos-rebuild - **NFS Exports**: Resolved dataset conflicts by commenting out conflicting tmpfiles rules - **Network Access**: Added Tailscale interface (tailscale0) as trusted interface in firewall #### Data Verified: - **Storage Pool**: 903GB used, 896GB available - **Media Content**: Films (184GB), Series (612GB), Audiobooks (94GB), Music (9.1GB), Books (3.5GB) - **Mount Points**: `/mnt/storage` and `/mnt/storage/media` with proper ZFS auto-mounting - **NFS Access**: Both datasets exported with proper permissions for network access ### grey-area Deployment (COMPLETED) ✅ NEW **Date**: June 2025 **Status**: ✅ Fully operational **Machine**: Intel Xeon E5-2670 v3 (24 cores) @ 3.10 GHz, 31.24 GiB RAM #### Key Achievements: - **Flake Configuration**: Successfully deployed NixOS flake-based configuration - **NFS Client**: Configured reliable NFS mount to sleeper-service media storage via Tailscale - **Service Stack**: Deployed comprehensive application server with multiple services - **Network Integration**: Integrated with centralized extraHosts module using Tailscale IPs - **User Management**: Resolved UID conflicts and implemented consistent user configuration - **Firewall Configuration**: Properly configured ports for all services #### Services Deployed: - **Jellyfin**: ✅ Media server with access to NFS-mounted content from sleeper-service - **Calibre-web**: ✅ E-book management and reading interface - **Forgejo**: ✅ Git hosting server (git.geokkjer.eu) with reverse proxy integration - **Audiobook Server**: ✅ Audiobook streaming and management #### Technical Implementation: - **NFS Mount**: `/mnt/remote/media` successfully mounting `sleeper-service:/mnt/storage/media` - **Network Path**: Using Tailscale mesh (100.x.x.x) for reliable connectivity - **Mount Options**: Configured with automount, soft mount, and appropriate timeouts - **Firewall Ports**: 22 (SSH), 3000 (Forgejo), 23231 (other services) - **User Configuration**: Fixed UID consistency with centralized sma user module #### Data Access Verified: - **Movies**: 38 films accessible via NFS - **TV Series**: 29 series collections - **Music**: 9 music directories - **Audiobooks**: 79 audiobook collections - **Books**: E-book collection - **Media Services**: All content accessible through Jellyfin and other services ### reverse-proxy Integration (COMPLETED) ✅ NEW **Date**: June 2025 **Status**: ✅ Fully operational **Machine**: External VPS (46.226.104.98) #### Key Achievements: - **Nginx Configuration**: Successfully configured reverse proxy for Forgejo - **Hostname Resolution**: Fixed hostname mapping from incorrect "apps" to correct "grey-area" - **SSL/TLS**: Configured ACME Let's Encrypt certificate for git.geokkjer.eu - **SSH Forwarding**: Configured SSH proxy on port 1337 for Git operations - **Network Security**: Implemented DMZ-style security with Tailscale-only SSH access #### Technical Configuration: - **HTTP Proxy**: `git.geokkjer.eu` → `http://grey-area:3000` (Forgejo) - **SSH Proxy**: Port 1337 → `grey-area:22` for Git SSH operations - **Network Path**: External traffic → reverse-proxy → Tailscale → grey-area - **Security**: SSH restricted to Tailscale network, fail2ban protection - **DNS**: Proper hostname resolution via extraHosts module ### Centralized Network Configuration (COMPLETED) ✅ NEW **Date**: June 2025 **Status**: ✅ Fully operational #### Key Achievements: - **extraHosts Module**: Created centralized hostname resolution using Tailscale IPs - **Network Consistency**: All machines use same IP mappings for reliable mesh connectivity - **SSH Configuration**: Updated IP addresses in ssh-keys.nix module - **User Management**: Resolved user configuration conflicts between modules #### Network Topology: - **Tailscale Mesh IPs**: - `100.109.28.53` - congenital-optimist (workstation) - `100.81.15.84` - sleeper-service (NFS file server) - `100.119.86.92` - grey-area (application server) - `100.96.189.104` - reverse-proxy (external VPS) - `100.103.143.108` - pihole (DNS server) - `100.126.202.40` - wordpresserver (legacy) #### Module Integration: - **extraHosts**: Added to all machine configurations for consistent hostname resolution - **SSH Keys**: Updated IP addresses (grey-area: 10.0.0.12, reverse-proxy: 46.226.104.98) - **User Modules**: Fixed conflicts between sma user definitions in different modules ### Home Lab Deployment Tool (COMPLETED) ✅ NEW **Date**: Recently completed **Status**: ✅ Fully operational **Tool**: `lab` command - Custom deployment management system #### Key Achievements: - **Custom Package Creation**: Developed `home-lab-tools.nix` package with comprehensive deployment functionality - **System Integration**: Added lab command to system packages via `modules/system/applications.nix` - **Conflict Resolution**: Resolved shell alias conflict by renaming "lab" alias to "home-lab" - **Multi-Machine Support**: Deployment capabilities for sleeper-service, grey-area, and reverse-proxy - **Status Monitoring**: Infrastructure connectivity checking with color-coded output - **Deployment Modes**: Support for boot, test, and switch deployment modes #### Technical Implementation: - **Package Structure**: Custom Nix package using `writeShellScriptBin` with proper dependencies - **Color-Coded Logging**: Blue info, green success, yellow warnings, red errors for clear output - **SSH Infrastructure**: Leverages existing SSH key management for secure remote deployment - **Rsync Deployment**: Efficient configuration syncing to target machines - **Error Handling**: Comprehensive error checking and validation throughout deployment process - **Service Detection**: Proper Tailscale service monitoring with `tailscaled` detection #### Available Commands: - **`lab status`**: Check connectivity to all infrastructure machines - **`lab deploy [mode]`**: Deploy configuration to specific machine - **Machines**: sleeper-service, grey-area, reverse-proxy - **Modes**: boot (default), test (temporary), switch (permanent) - **Help System**: Built-in usage documentation and examples #### Deployment Workflow: 1. **Configuration Sync**: Uses rsync to transfer entire Home-lab directory to target machine 2. **Remote Execution**: SSH into target machine and execute `nixos-rebuild` with flake 3. **Validation**: Checks deployment success and provides clear feedback 4. **Status Verification**: Can verify deployment results with status command #### Infrastructure Status Integration: - **Local Machine**: Checks Tailscale service status on congenital-optimist - **Remote Machines**: SSH connectivity testing with timeout handling - **Network Topology**: Integrates with existing Tailscale mesh network - **Service Monitoring**: Foundation for future comprehensive monitoring system #### Usage Examples: ```bash lab status # Check all machine connectivity lab deploy sleeper-service boot # Deploy and set for next boot lab deploy grey-area switch # Deploy and activate immediately lab deploy reverse-proxy test # Deploy temporarily for testing ``` #### Technical Benefits: 1. **Centralized Deployment**: Single command interface for all home lab machines 2. **Consistent Process**: Standardized deployment workflow across infrastructure 3. **Error Prevention**: Validation and safety checks prevent deployment failures 4. **Operational Visibility**: Clear status reporting for infrastructure state 5. **Extensibility**: Modular design allows easy addition of new machines and features 6. **Integration**: Seamless integration with existing SSH and Tailscale infrastructure --- ## Phase 1: Flakes Migration (Priority: High) ### 1.1 Create Flake Foundation - [x] Create `flake.nix` at repository root - [x] Define nixpkgs input pinned to NixOS 25.05 - [x] Add nixpkgs-unstable for bleeding edge packages - [x] Structure outputs for multiple machines (no Home Manager) - [x] Fix inconsistent naming convention (machine directories to lowercase) - [x] Update flake outputs to use correct lowercase paths ### 1.2 Restructure Configuration - [x] Convert `configuration.nix` to flake-compatible format - [x] **Keep `system.stateVersion` as "23.11"** (maintains data compatibility) - [x] Update existing module imports for flake structure - [x] Integrate existing user configuration properly - [x] Fix nerd-fonts syntax for 25.05 compatibility - [x] Fix hostname typo (congenial-optimist → congenital-optimist) ### 1.3 Consolidate User Configuration - [x] Fix typo in `users/geir/user.nix` (progtams → programs) - Already correct - [x] Merge duplicate user packages between main config and user module - [x] Decide on package location strategy (system vs user level) - [x] Ensure all existing functionality is preserved ### 1.4 Configuration Testing & Validation - [x] Validate flake syntax with `nix flake check` - [x] Test build without switching: `nixos-rebuild build --flake` - [x] Test configuration: `nixos-rebuild test --flake` - [x] **Successfully tested modularized configuration with virtualization** ### 1.5 Desktop Environment Modularization ✅ NEW - [x] Split monolithic `environments.nix` into modular components: - [x] `common.nix` - Shared desktop configuration (XDG portal, dbus) - [x] `gnome.nix` - GNOME desktop environment with extensions - [x] `cosmic.nix` - System76 Cosmic desktop environment - [x] `sway.nix` - Sway window manager with Wayland tools - [x] Update main configuration to use modular desktop imports - [x] Test modular desktop configuration successfully ### 1.6 Virtualization Stack ✅ NEW - [x] Add comprehensive virtualization support: - [x] **Incus** - Modern container and VM management (replaces LXD) - [x] **Libvirt/QEMU** - Full KVM virtualization with virt-manager - [x] **Podman** - Rootless containers with Docker compatibility - [x] Configure proper user groups (incus-admin, libvirt, podman) - [x] Enable UEFI/OVMF support for modern VM guests - [x] Test all virtualization services running successfully - [ ] Create rollback plan and ZFS snapshots - [ ] Switch to flake configuration permanently ### 1.7 GitOps Foundation & CI/CD Setup ✅ NEW - [x] Initialize git repository for infrastructure as code - [x] Create comprehensive `.gitignore` for NixOS/Nix projects - [x] Set up initial commit with current modular configuration - [x] Plan CI/CD pipeline for configuration validation - [x] Design branch strategy for infrastructure changes - [x] Create templates for pull request workflows - [x] Plan automated testing for configuration changes - [x] Set up secrets management strategy for CI/CD - [x] Document GitOps workflow for multi-machine deployments ### 1.8 Additional Migration Tasks - [x] Update all documentation files to use consistent naming - [x] Update flake descriptions and comments for clarity - [x] Verify all module imports work correctly in new structure - [x] Modularize congenital-optimist configuration into logical modules - [ ] Clean up any remaining references to old PascalCase paths - [ ] Test that existing aliases and CLI tools still work - [ ] Verify desktop environments (GNOME, Cosmic, Sway) all function - [ ] Test virtualization stack (podman, libvirt, incus) functionality - [ ] Validate ZFS and storage configuration compatibility - [x] Generate and commit flake.lock file - [ ] Create backup of current working configuration before final switch ## Phase 2: Configuration Cleanup & Organization ### 2.1 Optimize Current Modular Structure - [ ] Review and optimize existing `common/base.nix` tools - [ ] Enhance `common/tty.nix` console configuration - [ ] Validate virtualization modules are complete - [ ] Create desktop environment modules (separate GNOME, Cosmic, Sway) - [ ] Separate development tools into dedicated module ### 2.2 Target Directory Structure ``` Home-lab/ ├── flake.nix ├── flake.lock ├── machines/ │ ├── congenital-optimist/ (AMD workstation) │ │ ├── default.nix (main machine config) │ │ ├── hardware-configuration.nix │ │ └── About.org │ ├── sleeper-service/ (Intel Xeon file server) │ │ ├── default.nix (file server config) │ │ ├── hardware-configuration.nix │ │ └── About.org │ ├── reverse-proxy/ (edge/gateway server) │ │ ├── default.nix │ │ ├── hardware-configuration.nix │ │ └── About.org │ └── grey-area/ (application server) │ ├── default.nix │ ├── hardware-configuration.nix │ └── About.org ├── modules/ │ ├── common/ │ │ ├── base.nix (existing modern CLI tools) │ │ ├── tty.nix (existing console config) │ │ └── nix.nix (flakes + experimental features) │ ├── desktop/ │ │ ├── gnome.nix │ │ ├── cosmic.nix │ │ └── sway.nix │ ├── development/ │ │ ├── editors.nix (emacs, neovim, vscode, etc.) │ │ ├── languages.nix (rust, python, LSPs) │ │ └── tools.nix │ ├── virtualization/ (existing) │ │ ├── podman.nix │ │ ├── libvirt.nix │ │ └── incus.nix │ ├── services/ (for SleeperService + grey-area) │ │ ├── nfs.nix (network file sharing) │ │ ├── samba.nix (windows compatibility) │ │ ├── backup.nix (automated backups) │ │ ├── monitoring.nix (system monitoring) │ │ ├── storage.nix (ZFS/RAID management) │ │ ├── reverse-proxy.nix (nginx/traefik configuration) │ │ ├── forgejo.nix (git hosting and CI/CD) │ │ ├── media.nix (jellyfin configuration) │ │ └── applications.nix (containerized services) │ └── users/ │ └── common.nix (shared user configurations) ├── users/ │ └── geir/ │ ├── dotfiles/ │ │ ├── README.org (main literate config) │ │ ├── emacs/ │ │ ├── shell/ │ │ └── editors/ │ └── user.nix (consolidated user config) ├── overlays/ ├── packages/ └── secrets/ (for future secrets management) ``` ### 2.3 Network Infrastructure Updates - [x] **Network topology discovery**: Used nmap to map actual network layout - **Network Range**: `10.0.0.0/24` (not 192.168.1.x as initially assumed) - **Gateway**: `10.0.0.138` (lan.home - router/firewall) - **DNS Server**: `10.0.0.14` (pi.hole - Pi-hole ad-blocker) - **Current File Server**: `10.0.0.8` (files.home - will be renamed to sleeper-service) - **Machine Migration**: sleeper-service is the existing files.home machine, not a new deployment - [x] **sleeper-service systemd-networkd migration**: ✅ **COMPLETED and DEPLOYED** - [x] **Hostname transition**: Successfully renamed from files.home to sleeper-service - [x] **Static IP preserved**: Maintained 10.0.0.8/24 with gateway 10.0.0.138 - [x] **DNS integration**: Pi-hole primary (10.0.0.14), router fallback (10.0.0.138), Google DNS (8.8.8.8) - [x] **Network stack**: `networking.useNetworkd = true` with `networking.useDHCP = false` - [x] **Interface configuration**: `enp0s25` configured with declarative static IPv4 - [x] **Service ports**: File server ports configured (NFS: 111,2049; SMB: 139,445; NetBIOS: 137,138) - [x] **Production validation**: Network configuration tested and operational - [ ] **Network standardization**: Plan consistent networkd configuration across all server role machines workstation and laptop can use networkmanager - [x] **IP address allocation**: Document static IP assignments for each service - **Local Network (10.0.0.0/24)**: - **10.0.0.2**: arlaptop.home (existing laptop) - **10.0.0.3**: congenital-optimist (AMD workstation - current machine) - **10.0.0.8**: sleeper-service (Intel Xeon file server - rename from files.home) - **10.0.0.11**: grey-area (planned application server) - **10.0.0.12**: reverse-proxy (planned edge server) - **10.0.0.14**: pi.hole (Pi-hole DNS/ad-blocker) maybe move to nixos - **10.0.0.90**: wordpresserver.home (existing WordPress server) to be deleted, incus container - **10.0.0.117**: webdev.home (existing web development server) to be deleted, incus container - **10.0.0.138**: lan.home (router/gateway/dhcp) - **Tailscale Network (100.x.x.x/10)**: - **100.109.28.53**: congenital-optimist (current machine) - **100.119.86.92**: apps (active server) (rename to grey area) - **100.114.185.71**: arlaptop (laptop) (Arch Linux with plans to migrate to NixOS) - **100.81.15.84**: files (file server rename to sleeper-service ) - **100.103.143.108**: pihole (DNS server) - **100.96.189.104**: vps1 (external VPS) (rename to reverse proxy) - **100.126.202.40**: wordpresserver (WordPress) to be deleted - remind user to update tailsce or find a way to use the cli to do this - [ ] **VLAN planning**: Consider network segmentation for different service types - [ ] **DNS configuration**: Plan local DNS resolution for internal services ## Phase 3: System Upgrade & Validation ### 3.1 Pre-upgrade Preparation - [ ] Backup current system configuration - [ ] Document current package versions - [ ] Create ZFS snapshots of all datasets - [ ] Test flake build without switching - [ ] Verify all existing modules work in flake context ### 3.2 Upgrade Execution - [ ] Switch to flake-based configuration - [ ] Upgrade to NixOS 25.05 - [ ] Validate all services start correctly - [ ] Test desktop environments functionality - [ ] Verify virtualization stack - [ ] Check user environment and packages ### 3.3 Post-upgrade Validation - [ ] Verify all applications launch - [ ] Test development tools (editors, LSPs, compilers) - [ ] Validate container and VM functionality - [ ] Check ZFS and NFS mount operations - [ ] Verify shell environment and modern CLI tools work - [ ] Test console theming and TTY setup ## Phase 4: Dotfiles & Configuration Management ### 4.1 GNU Stow Infrastructure for Regular Dotfiles ✅ DECIDED **Approach**: Use GNU Stow for traditional dotfiles, literate programming for Emacs only #### GNU Stow Setup - [ ] Create `~/dotfiles/` directory structure with package-based organization - [ ] Set up core packages: `zsh/`, `git/`, `tmux/`, `starship/`, etc. - [ ] Configure selective deployment per machine (workstation vs servers) - [ ] Create stow deployment scripts for different machine profiles - [ ] Document stow workflow and package management #### Package Structure ``` ~/dotfiles/ # Stow directory (target: $HOME) ├── zsh/ # Shell configuration │ ├── .zshrc │ ├── .zshenv │ └── .config/zsh/ ├── git/ # Git configuration │ ├── .gitconfig │ └── .config/git/ ├── starship/ # Prompt configuration │ └── .config/starship.toml ├── tmux/ # Terminal multiplexer │ └── .tmux.conf ├── emacs/ # Basic Emacs bootstrap (points to literate config) │ └── .emacs.d/early-init.el └── machine-specific/ # Per-machine configurations ├── workstation/ └── server/ ``` ### 4.2 Literate Programming for Emacs Configuration ✅ DECIDED **Approach**: Comprehensive org-mode literate configuration for Emacs only #### Emacs Literate Setup - [ ] Create `~/dotfiles/emacs/.emacs.d/configuration.org` as master config - [ ] Set up automatic tangling on save (org-babel-tangle-on-save) - [ ] Modular org sections: packages, themes, keybindings, workflows - [ ] Bootstrap early-init.el to load tangled configuration - [ ] Create machine-specific customizations within org structure #### Literate Configuration Structure ``` ~/dotfiles/emacs/.emacs.d/ ├── early-init.el # Bootstrap (generated by Stow) ├── configuration.org # Master literate config ├── init.el # Tangled from configuration.org ├── modules/ # Tangled module files │ ├── base.el │ ├── development.el │ ├── org-mode.el │ └── ui.el └── machine-config/ # Machine-specific overrides ├── workstation.el └── server.el ``` ### 4.3 Integration Strategy - [ ] **System-level**: NixOS modules provide system packages and environment - [ ] **User-level**: GNU Stow manages dotfiles and application configurations - [ ] **Emacs-specific**: Org-mode literate programming for comprehensive Emacs setup - [ ] **Per-machine**: Selective stow packages + machine-specific customizations - [ ] **Version control**: Git repository for dotfiles with separate org documentation ### 4.4 Deployment Workflow - [ ] Create deployment scripts for different machine types: - **Workstation**: Full package deployment (zsh, git, tmux, starship, emacs) - **Server**: Minimal package deployment (zsh, git, basic emacs) - **Development**: Additional packages (language-specific tools, IDE configs) - [ ] Integration with existing `lab` deployment tool - [ ] Documentation for new user onboarding across machines ## Phase 5: Home Lab Expansion Planning ### 5.1 Infrastructure Additions #### Naming Convention - **Machine Names**: Culture ship names in PascalCase (e.g., `CongenitalOptimist`, `SleeperService`) - **Folder Names**: lowercase-with-hyphens (e.g., `congenital-optimist/`, `sleeper-service/`) - **Flake Outputs**: lowercase-with-hyphens (e.g., `nixosConfigurations.congenital-optimist`) - **Hostnames**: lowercase-with-hyphens (e.g., `congenital-optimist`, `sleeper-service`) - **User Names**: Culture character names in lowercase (e.g., `sma`, `geir`) - [x] **SleeperService** file server (Intel Xeon E3-1230 V2, 16GB RAM): ✅ **COMPLETED** - [x] NFS server for network storage (903GB ZFS pool operational) - [x] ZFS storage with native mounting configuration - [x] Flake-based NixOS deployment successful - [x] SSH key management implemented - [x] Network configuration with Pi-hole DNS integration - [x] System boots cleanly in ~1 minute with ZFS auto-mounting - [x] Data preservation verified (Films: 184GB, Series: 612GB, etc.) - [x] NFS exports configured for both local and Tailscale networks - [x] Resolved dataset conflicts and tmpfiles rule conflicts - [ ] Automated backup services (future enhancement) - [ ] System monitoring and alerting (future enhancement) - [x] **reverse-proxy** edge server: ✅ **COMPLETED** - [x] Nginx reverse proxy with proper hostname mapping (grey-area vs apps) - [x] SSL/TLS termination with Let's Encrypt for git.geokkjer.eu - [x] External access gateway with DMZ security configuration - [x] SSH forwarding on port 1337 for Git operations - [x] Fail2ban protection and Tailscale-only SSH access - [x] Minimal attack surface, headless operation - [x] **grey-area** application server (Culture GCU - versatile, multi-purpose): ✅ **COMPLETED** - [x] **Primary**: Forgejo Git hosting (git.geokkjer.eu) with reverse proxy integration - [x] **Secondary**: Jellyfin media server with NFS-mounted content - [x] **Additional**: Calibre-web e-book server and audiobook streaming - [x] **Infrastructure**: Container-focused (Podman), NFS client for media storage - [x] **Integration**: Central Git hosting accessible externally via reverse proxy - [x] **Network**: Integrated with Tailscale mesh and centralized hostname resolution - [x] **User Management**: Resolved UID conflicts with centralized sma user configuration - [ ] **Monitoring**: TBD (future enhancement) - [ ] **PostgreSQL**: Plan database services for applications requiring persistent storage - [ ] Plan for additional users across machines: - [x] **geir** - Primary user (development, desktop, daily use) - [x] **sma** - Admin user (Diziet Sma, system administration, security oversight) - [ ] Service accounts for automation (forgejo-admin, backup-agent) - [ ] Guest accounts for temporary access - [x] Culture character naming convention established - [x] **Network infrastructure planning**: Started with sleeper-service systemd-networkd migration - [ ] Consider hardware requirements for future expansion ### 5.2 Services Architecture - [ ] Centralized configuration management - [ ] Per-user secrets management (agenix/sops-nix) - [ ] User-specific service configurations - [ ] Monitoring and logging (Prometheus, Grafana) - [ ] Backup strategy across machines and users - [ ] Container orchestration planning ### 5.3 Security & Networking - [x] **systemd-networkd migration**: Completed for sleeper-service with static IP configuration - [x] **SSH key management centralization**: ✅ **IMPLEMENTED and DEPLOYED** - [x] **Admin key** (`geir@geokkjer.eu-admin`): For sma user, server administration access - [x] **Development key** (`geir@geokkjer.eu-dev`): For geir user, git services, daily development - [x] **NixOS module**: `modules/security/ssh-keys.nix` centralizes key management - [x] **SSH client config**: Updated with role-based host patterns and key selection - [x] **Production deployment**: Successfully deployed on sleeper-service - [x] **Security benefits**: Principle of least privilege, limited blast radius if compromised - [x] **Usage examples**: - `ssh geir@sleeper-service.home` - Uses dev key automatically - `ssh admin-sleeper` - Uses admin key for sma user access - `git clone git@github.com:user/repo` - Uses dev key for git operations - [ ] VPN configuration (Tailscale expansion) - [ ] Firewall rules standardization across machines - [ ] Certificate management (Let's Encrypt) - [ ] Network segmentation planning (VLANs for services vs. user devices) - [ ] DNS infrastructure (local DNS server for service discovery) ## Phase 6: Advanced Features ### 6.1 Development Workflow - [ ] Devshells for different projects - [ ] Cachix setup for faster builds - [ ] CI/CD integration - [ ] Literate dotfiles with org-mode tangling automation ### 6.2 Automation & Maintenance - [ ] AI integration - development of a mcp server for the cluster - [ ] Automated system updates - [ ] Configuration validation tests - [ ] Deployment automation - [ ] Monitoring and alerting ### 6.3 Advanced Deployment Strategies ✅ RESEARCH COMPLETED #### Deploy-rs Migration (Priority: High) 📋 RESEARCHED - [x] **Research deploy-rs capabilities** ✅ COMPLETED - [x] Rust-based deployment tool specifically designed for NixOS flakes - [x] Features: parallel deployment, automatic rollback, health checks, SSH-based - [x] Advanced capabilities: atomic deployments, magic rollback on failure - [x] Profile management: system, user, and custom profiles support - [x] Integration potential: Works with existing SSH keys and Tailscale network - [ ] **Migration Planning**: Transition from custom `lab` script to deploy-rs - [ ] Create deploy-rs configuration in flake.nix for all 4 machines - [ ] Configure nodes: sleeper-service, grey-area, reverse-proxy, congenital-optimist - [ ] Set up health checks for critical services (NFS, Forgejo, Jellyfin, nginx) - [ ] Test parallel deployment capabilities across infrastructure - [ ] Implement automatic rollback for failed deployments - [ ] Document migration benefits and new deployment workflow #### Deploy-rs Configuration Structure ```nix # flake.nix additions deploy.nodes = { sleeper-service = { hostname = "100.81.15.84"; # Tailscale IP profiles.system.path = deploy-rs.lib.x86_64-linux.activate.nixos self.nixosConfigurations.sleeper-service; profiles.system.user = "root"; }; grey-area = { hostname = "100.119.86.92"; profiles.system.path = deploy-rs.lib.x86_64-linux.activate.nixos self.nixosConfigurations.grey-area; # Health checks for Forgejo, Jellyfin services }; reverse-proxy = { hostname = "100.96.189.104"; profiles.system.path = deploy-rs.lib.x86_64-linux.activate.nixos self.nixosConfigurations.reverse-proxy; # Health checks for nginx, SSL certificates }; }; ``` #### Migration Benefits - **Atomic deployments**: Complete success or automatic rollback - **Parallel deployment**: Deploy to multiple machines simultaneously - **Health checks**: Validate services after deployment - **Connection resilience**: Better handling of SSH/network issues - **Flake-native**: Designed specifically for NixOS flake workflows - **Safety**: Magic rollback prevents broken deployments #### Alternative: Guile Scheme Exploration (Priority: Low) - [ ] **Research Guile Scheme for system administration** - [ ] Evaluate functional deployment scripting patterns - [ ] Compare with current shell script and deploy-rs approaches - [ ] Consider integration with GNU Guix deployment patterns - [ ] Assess learning curve vs. practical benefits for home lab use case ### 6.4 Writeup - [ ] Take all the knowledge we have amassed and make a blog post or a series of blog posts ### Phase 7: goin pro - [ ] A plan to generalise this project so it is usable for other people - [ ] A plan to make dashboard and web interface for the project ## Timeline Estimates - **Phase 1**: 1-2 weeks (flakes migration) - **Phase 2**: 1 week (cleanup and organization) - **Phase 3**: 2-3 days (upgrade and validation) - **Phase 4**: 1 week (literate dotfiles setup) - **Phase 5**: 2-4 weeks (expansion planning and implementation) - **Phase 6**: Ongoing (advanced features as needed) ## Risk Mitigation ### Critical Risks 1. **Boot failure after upgrade**: ZFS snapshots for quick rollback 2. **Desktop environment issues**: Keep multiple DEs as fallback 3. **Virtualization breakage**: Document current VM configurations 4. **Data loss**: Multiple backup layers (ZFS, external) 5. **User environment regression**: Backup existing dotfiles ### Rollback Strategy - ZFS snapshot rollback capability - Keep old configuration.nix as reference - Maintain emergency boot media - Document manual recovery procedures - Preserve current user configuration during migration ## Current Status Overview (Updated December 2024) ### Infrastructure Deployment Status ✅ MAJOR MILESTONE ACHIEVED ✅ **PHASE 1**: Flakes Migration - **COMPLETED** ✅ **PHASE 2**: Configuration Cleanup - **COMPLETED** ✅ **PHASE 3**: System Upgrade & Validation - **COMPLETED** ✅ **PHASE 5**: Home Lab Expansion - **4/4 MACHINES FULLY OPERATIONAL** 🎉 ### Machine Status - ✅ **congenital-optimist**: Development workstation (fully operational) - ✅ **sleeper-service**: NFS file server with 903GB media library (fully operational) - ✅ **grey-area**: Application server with Forgejo, Jellyfin, Calibre-web, audiobook server (fully operational) - ✅ **reverse-proxy**: External gateway with nginx, SSL termination, SSH forwarding (fully operational) ### Network Architecture Status - ✅ **Tailscale Mesh**: All machines connected via secure mesh network (100.x.x.x addresses) - ✅ **Hostname Resolution**: Centralized extraHosts module deployed across all machines - ✅ **NFS Storage**: Reliable media storage access via Tailscale network (sleeper-service → grey-area) - ✅ **External Access**: Public services accessible via git.geokkjer.eu with SSL - ✅ **SSH Infrastructure**: Centralized key management with role-based access patterns - ✅ **Firewall Configuration**: Service ports properly configured across all machines ### Services Status - FULLY OPERATIONAL STACK 🚀 - ✅ **Git Hosting**: Forgejo operational at git.geokkjer.eu with SSH access on port 1337 - ✅ **Media Streaming**: Jellyfin with NFS-mounted content library (38 movies, 29 TV series) - ✅ **E-book Management**: Calibre-web for book collections - ✅ **Audiobook Streaming**: Audiobook server with 79 audiobook collections - ✅ **File Storage**: NFS server with 903GB media library accessible across network - ✅ **Web Gateway**: Nginx reverse proxy with Let's Encrypt SSL and proper hostname mapping - ✅ **User Management**: Consistent UID/GID configuration across machines (sma user: 1001/992) ### Infrastructure Achievements - COMPREHENSIVE DEPLOYMENT ✅ - ✅ **NFS Mount Resolution**: Fixed grey-area `/mnt/storage` → `/mnt/storage/media` dataset access - ✅ **Network Exports**: Updated sleeper-service NFS exports for Tailscale network (100.64.0.0/10) - ✅ **Service Discovery**: Corrected reverse-proxy hostname mapping from "apps" to "grey-area" - ✅ **Firewall Management**: Added port 3000 for Forgejo service accessibility - ✅ **SSH Forwarding**: Configured SSH proxy on port 1337 for Git operations - ✅ **SSL Termination**: Let's Encrypt certificates working for git.geokkjer.eu - ✅ **Data Verification**: All media content accessible (movies, TV, music, audiobooks, books) - ✅ **Deployment Tools**: Custom `lab` command operational for infrastructure management ### Current Operational Status **🟢 ALL CORE INFRASTRUCTURE DEPLOYED AND OPERATIONAL** - **4/4 machines deployed** with full service stack - **External access verified**: `curl -I https://git.geokkjer.eu` returns HTTP/2 200 - **NFS connectivity confirmed**: Media files accessible across network via Tailscale - **Service integration complete**: Forgejo, Jellyfin, Calibre-web, audiobook server running - **Network mesh stable**: All machines connected via Tailscale with centralized hostname resolution ### Next Phase Priorities - [ ] **PHASE 4**: GNU Stow + Literate Emacs Setup - [ ] Set up GNU Stow infrastructure for regular dotfiles (zsh, git, tmux, starship) - [ ] Create comprehensive Emacs literate configuration with org-mode - [ ] Implement selective deployment per machine type (workstation vs server) - [ ] Integration with existing NixOS system-level configuration - [ ] **PHASE 6**: Advanced Features & Deploy-rs Migration - [ ] Migrate from custom `lab` script to deploy-rs for improved deployment - [ ] Implement system monitoring and alerting infrastructure - [ ] Set up automated backup services for critical data - [ ] Create health checks and deployment validation - [ ] **Documentation & Knowledge Sharing** - [ ] Comprehensive blog post series documenting the full home lab journey - [ ] User guides for GNU Stow + literate Emacs configuration workflow - [ ] Deploy-rs migration guide and lessons learned - [ ] **SSH & Network Infrastructure Improvements**: Combined priority for related infrastructure upgrades - [ ] SSH connection testing with original ed25519 key (already approved in Forgejo) - [ ] Consider testing direct connection to forgejo@grey-area first to bypass proxy - [ ] SSH debugging and key management refinement - [ ] Migration from nginx streams to HAProxy for better SSH forwarding and load balancing - [ ] Gradual re-hardening of SSH security (Tailscale-only access) after Git verification - [ ] Deploy-rs migration for improved deployment automation and health checks - [ ] **Future Enhancements** - [ ] User ID consistency cleanup (sma user UID alignment across machines) - [ ] CI/CD integration with Forgejo for automated testing and deployment --- ## Success Criteria ### Core Infrastructure ✅ FULLY ACHIEVED 🎉 - [x] System boots reliably with flake configuration - [x] All current functionality preserved - [x] NixOS 25.05 running stable across all machines - [x] Configuration is modular and maintainable - [x] User environment fully functional with all packages - [x] Modern CLI tools and aliases working - [x] Console theming preserved - [x] Virtualization stack operational - [x] **Multi-machine expansion completed (4/4 machines deployed)** - [x] Development workflow improved with Git hosting ### Service Architecture ✅ FULLY ACHIEVED 🚀 - [x] NFS file server operational with reliable network access via Tailscale - [x] Git hosting with external access via reverse proxy (git.geokkjer.eu) - [x] Media services with shared storage backend (Jellyfin + 903GB library) - [x] E-book and audiobook management services operational - [x] Secure external access with SSL termination and SSH forwarding - [x] Network mesh connectivity with centralized hostname resolution - [x] **All services verified operational and accessible externally** ### Network Integration ✅ FULLY ACHIEVED 🌐 - [x] Tailscale mesh network connecting all infrastructure machines - [x] Centralized hostname resolution via extraHosts module - [x] NFS file sharing working reliably over network - [x] SSH key management with role-based access patterns - [x] Firewall configuration properly securing all services - [x] **External domain (git.geokkjer.eu) with SSL certificates working** ### Outstanding Enhancement Goals 🔄 - [ ] Literate dotfiles workflow established with org-mode - [ ] Documentation complete for future reference and blog writeup - [ ] System monitoring and alerting infrastructure (Prometheus/Grafana) - [ ] Automated deployment and maintenance improvements - [ ] Automated backup services for critical data - [ ] User ID consistency cleanup across machines ## Infrastructure Notes ### CongenitalOptimist (AMD Workstation) - Already has excellent modular structure - Modern CLI tools (eza, bat, ripgrep, etc.) already configured in base.nix - Console theming with Joker palette already implemented - User configuration needs cleanup (fix typo, consolidate packages) - ZFS configuration is solid and shouldn't need changes - Keep Tailscale configuration as network foundation - The AMD GPU setup should carry over cleanly to 25.05 - Consider renaming hostname from "work" to "congenital-optimist" for consistency ### SleeperService (Intel Xeon File Server) - Intel Xeon E3-1230 V2 @ 3.70GHz (4 cores, 8 threads) - 16GB RAM - adequate for file server operations - Perfect for reliable, background file serving tasks - Culture name fits: "massive GSV with reputation for taking unusual tasks" - Will handle NFS mounts currently served by external "files" server - Plan for ZFS or software RAID for data redundancy - Headless operation - no desktop environments needed - SSH-only access with robust monitoring ### reverse-proxy (Edge Server) - Lightweight hardware requirements (can be modest specs) - Primary role: SSL/TLS termination and traffic routing - External-facing server with minimal attack surface - Nginx or Traefik for reverse proxy functionality - Let's Encrypt integration for automated certificate management - Fail2ban and security hardening - Routes traffic to internal services (grey-area, sleeper-service) ### grey-area (Application Server - Culture GCU) - **Hardware**: Intel Xeon E5-2670 v3 (24 cores) @ 3.10 GHz, 31.24 GiB RAM - **Primary Mission**: Forgejo Git hosting and project management - **Performance**: Excellent specs for heavy containerized workloads and CI/CD - **Container-focused architecture** using Podman - **PostgreSQL database** for Forgejo - **Concurrent multi-service deployment capability** - **Secondary services**: Jellyfin (with transcoding), Nextcloud, Grafana - Integration hub for all home lab development projects - Culture name fits: "versatile ship handling varied, ambiguous tasks" - Central point for CI/CD pipelines and automation ### Home Lab Philosophy - Emacs org-mode literate programming approach provides better control than Home Manager - Culture ship names create memorable, characterful infrastructure - Modular NixOS configuration allows easy machine additions - Per-user dotfiles structure scales across multiple machines - Tailscale provides secure network foundation for multi-machine setup #### Recent Critical Issue Resolution (December 2024) 🔧 **NFS Mount and Service Integration Issues - RESOLVED** 1. **NFS Dataset Structure Resolution**: - **Problem**: grey-area couldn't access media files via NFS mount - **Root Cause**: ZFS dataset structure confusion - mounting `/mnt/storage` vs `/mnt/storage/media` - **Solution**: Updated grey-area NFS mount from `sleeper-service:/mnt/storage` to `sleeper-service:/mnt/storage/media` - **Result**: All media content now accessible (38 movies, 29 TV series, 9 music albums, 79 audiobooks) 2. **NFS Network Export Configuration**: - **Problem**: NFS exports only configured for local network (10.0.0.0/24) - **Root Cause**: Missing Tailscale network access in NFS exports - **Solution**: Updated sleeper-service NFS exports to include Tailscale network (100.64.0.0/10) - **Result**: Reliable NFS connectivity over Tailscale mesh network 3. **Conflicting tmpfiles Rules**: - **Problem**: systemd tmpfiles creating conflicting directory structures for NFS exports - **Root Cause**: tmpfiles.d rules interfering with ZFS dataset mounting - **Solution**: Commented out conflicting tmpfiles rules in sleeper-service configuration - **Result**: Clean NFS export structure without mounting conflicts 4. **Forgejo Service Accessibility**: - **Problem**: git.geokkjer.eu returning connection refused errors - **Root Cause**: Multiple issues - firewall ports, hostname mapping, SSH forwarding - **Solutions Applied**: - Added port 3000 to grey-area firewall configuration - Fixed reverse-proxy nginx configuration: `http://apps:3000` → `http://grey-area:3000` - Updated SSH forwarding: `apps:22` → `grey-area:22` for port 1337 - **Result**: External access verified - `curl -I https://git.geokkjer.eu` returns HTTP/2 200 5. **Hostname Resolution Consistency**: - **Problem**: Inconsistent hostname references across configurations ("apps" vs "grey-area") - **Root Cause**: Legacy hostname references in reverse-proxy configuration - **Solution**: Updated all configurations to use consistent "grey-area" hostname - **Result**: Proper service discovery and reverse proxy routing 6. **User ID Consistency Challenge**: - **Current State**: sma user has UID 1003 on grey-area vs 1001 on sleeper-service - **Workaround**: NFS access working via group permissions (users group: GID 100) - **Future Fix**: Implement centralized UID management across all machines #### Recent Troubleshooting & Solutions (June 2025): 8. **NFS Dataset Structure**: Proper understanding of ZFS dataset hierarchy crucial for NFS exports - `/mnt/storage` vs `/mnt/storage/media` dataset mounting differences - NFS exports must match actual ZFS dataset structure, not subdirectories - Client mount paths must align with server export paths for data access 9. **Network Transition Management**: Tailscale vs local network connectivity during deployment - NFS exports need both local (10.0.0.0/24) and Tailscale (100.64.0.0/10) network access - extraHosts module provides consistent hostname resolution across network changes - Firewall configuration must accommodate service ports for external access 10. **Reverse Proxy Configuration**: Hostname consistency critical for proxy functionality - nginx upstream configuration must use correct hostnames (grey-area not apps) - Service discovery relies on centralized hostname resolution modules - SSL certificate management works seamlessly with proper nginx configuration 11. **Service Integration**: Multi-machine service architecture requires coordinated configuration - Forgejo deployment spans grey-area (service) + reverse-proxy (gateway) + DNS (domain) - NFS client/server coordination requires matching export/mount configurations - User ID consistency across machines essential for NFS file access permissions 12. **Firewall Management**: Service-specific port configuration essential for functionality - Application servers need service ports opened (3000 for Forgejo, etc.) - Reverse proxy needs external ports (80, 443, 1337) and internal connectivity - SSH access coordination between local and Tailscale networks for security