You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Production-grade Kubernetes homelab β Proxmox VE β’ Talos Linux β’ Flux GitOps β’ Cilium β’ Rook-Ceph β’ GPU-accelerated AI.
π Overview
Prox-Ops is the complete infrastructure-as-code definition for a 15-node Kubernetes cluster running on a four-host Proxmox VE cluster. It runs media services, home automation, observability tooling, and a self-hosted AI/RAG stack on consumer GPUs β all reconciled from this repository by Flux.
The design goals are:
Reproducibility β every node, network, and workload is declared in Git. The cluster can be rebuilt from a clean Proxmox install.
Cattle, not pets β Talos nodes are immutable and replaced rather than patched. Image Factory schematics are the single source of truth for kernel args and drivers.
GitOps end-to-end β Flux v2 reconciles every workload from main. Direct cluster mutation is reserved for emergency rollback.
Encrypted by default β secrets live in Git as SOPS+age ciphertext, fed into the cluster via External Secrets Operator + 1Password Connect.
β¨ Features
Talos Linux on every Kubernetes node β immutable, API-driven, no SSH
Cilium as the only CNI β kube-proxy replacement, L2 LoadBalancer, network policy
Multus for VLAN-attached pods β DMZ workloads and home-automation pods get their own L2 segments via macvlan
Rook-Ceph consuming an external Proxmox-managed Ceph cluster for both block (RBD) and shared (CephFS) storage
Envoy Gateway as the in-cluster ingress, fronted by Cloudflare Tunnel for WAN access (zero open inbound ports)
Renovate keeps Helm charts, container images, and Talos versions current β versions ship to the cluster via PRs, not direct talosctl upgrade
NVIDIA RTX A5000 + A2000 for AI inference; GPU drivers are baked into a dedicated Image Factory schematic so non-GPU nodes stay lean
CodeRabbit + Gitleaks + GitGuardian + Flux local validation on every PR
Self-hosted GitHub Actions runners (gha-runner-scale-set) for CI
π§ Hardware
Proxmox Hosts
Host
Model
CPU
Threads
RAM
Role
baldar
Dell PowerEdge R730xd
2Γ Xeon E5-2697 v3 @ 2.60 GHz
56
128 GB
Compute
heimdall
Dell PowerEdge R730xd
2Γ Xeon E5-2697 v3 @ 2.60 GHz
56
256 GB
Compute + GPU passthrough
odin
Dell PowerEdge R740xd
2Γ Xeon Gold 6148 @ 2.40 GHz
80
128 GB
Compute
thor
Dell PowerEdge R740xd
2Γ Xeon Gold 6148 @ 2.40 GHz
80
256 GB
Compute + GPU passthrough
Kubernetes VM Layout
Each Proxmox host runs a slice of the cluster. The mapping is intentionally symmetric so any single host can be evacuated for maintenance without losing quorum.
Host
Control plane
Workers
GPU
baldar
k8s-ctrl-1
k8s-work-1, k8s-work-2, k8s-work-3
β
heimdall
k8s-ctrl-2
k8s-work-4, k8s-work-5, k8s-work-6
NVIDIA RTX A2000 β k8s-work-4
odin
k8s-ctrl-3
k8s-work-7, k8s-work-8, k8s-work-9
β
thor
β
k8s-work-10, k8s-work-11, k8s-work-12
NVIDIA RTX A5000 β k8s-work-10
GPU drivers and the NVIDIA Container Toolkit are baked into a dedicated Image Factory schematic β only GPU workers use it. All other nodes run the lean base schematic.
π Networking
CNI β Cilium replaces kube-proxy and handles L2 LoadBalancer announcement onto the home LAN.
Secondary CNI β Multus attaches pods to additional VLANs via macvlan when a workload needs an externally-routable address (DMZ services, IoT integrations).
Internal DNS β k8s-gateway answers cluster service hostnames on the LAN; UniFi forwards selected zones to it.
External DNS β external-dns reconciles a Cloudflare zone for public records.
Ingress β Envoy Gateway (Gateway API) provides both internal (LAN) and external (WAN-via-Cloudflare-Tunnel) listeners.
WAN access β every public service is reached through a Cloudflare Tunnel; the homelab has no inbound ports open.
Service mesh / observability β Cilium Hubble for flow visibility; Prometheus + Loki for metrics and logs.
# Cluster status
kubectl get nodes
kubectl get pods -A
cilium status
flux check
# Force a Flux reconciliation
task reconcile
# Talos
talosctl --nodes <node> dashboard
talosctl --nodes <node> dmesg --follow
# Logs
flux logs --follow
kubectl logs -n <namespace><pod># List all available Tasks
task --list
π Security Posture
All secrets committed to Git are encrypted with SOPS using age. Plaintext secret values never reach main.
Runtime secrets are fetched from 1Password via the External Secrets Operator + 1Password Connect β application manifests reference existingSecret, never literals.
WAN ingress is exclusively via Cloudflare Tunnel; the homelab has no inbound ports open to the internet.
Authentik provides SSO for internal services. cert-manager issues TLS for everything via Let's Encrypt.
Tetragon enforces eBPF-based runtime policies on sensitive namespaces.
The repository is public; pushes are gated by Gitleaks, GitGuardian, Flux local validation, and a mandatory pre-push security review.
π Bootstrapping a New Cluster
The bootstrap flow follows the onedr0p/cluster-template pattern this layout was originally derived from. At a high level:
Provision Proxmox with a Ceph cluster reachable from your Talos VMs.
Generate Talos VMs via terraform/ (creates Image Factory templates and clones VMs from them).
Bootstrap apps: task bootstrap:apps β installs Flux and seeds the cluster from this repository.
Verify: kubectl get nodes, cilium status, flux check.
Per-node specifics (MACs, install disks, schematic IDs, addresses) are kept in nodes.yaml, which is intentionally not committed β see .gitignore for the local-only files you'll need to populate.
π GitOps Workflow
Every change to a workload is a pull request:
Edit YAML under kubernetes/.
Open a PR β CI runs Flux local validation, secret scanning, and CodeRabbit review.
Merge to main β Flux reconciles within ~30 seconds.
Renovate keeps Helm charts, container images, and Talos versions current via automated PRs.
Direct kubectl against the cluster is limited to read-only inspection (get, describe, logs); the only state-mutating operation that bypasses Git is an emergency rollback, and only with explicit approval.
π Credits
Cluster layout, Renovate configuration, and Taskfile patterns are adapted from onedr0p/cluster-template.