On-Premise Kubernetes Platform

5 min read

On-Premise Kubernetes Platform

TL;DR: Kubernetes is the industry-standard system for running and orchestrating software across a fleet of servers—most organizations rent that fleet from a cloud provider like AWS or Azure. I built this platform on dedicated hardware instead, which means predictable costs (no surprise bills based on usage), no vendor lock-in, full control over the infrastructure, and the ability to run in air-gapped environments with no internet. The platform is designed to tolerate hardware failures without data loss, and everything is defined in code so the entire cluster can be rebuilt from scratch in hours.

Technical Deep Dive — Physical server specs, VM distribution, worker pool configs, and storage architecture.

Problem

I needed a way to economically and reliably host our software that simplified scaling. Cloud pricing works great until you’re running always-on workloads with significant storage—then the bills get surprising and unpredictable. On-prem gives us cost control, but it means we’re responsible for the resilience and scalability that cloud providers would otherwise handle.

The goals: infrastructure that tolerates hardware failures without data loss, scales predictably by adding nodes, and is simple enough to rebuild from scratch without it being a multi-day project.

Constraints

  • Commodity hardware. No enterprise SAN budget—resilience had to come from software, not expensive hardware.
  • Mixed storage needs. Some workloads need fast storage (databases), others need cheap capacity (backups, archives).
  • Simplified scaling. Adding capacity should be as straightforward as adding a node, not re-architecting.
  • Strong recovery guarantees. Hardware failures had to be survivable with tested recovery paths.

Decisions and Tradeoffs

Hyperconverged Proxmox with Ceph

Compute and storage run on the same physical nodes. Ceph provides distributed storage with replication across nodes, serving both the virtualization layer (Proxmox) and Kubernetes (via Rook CSI). I chose this over separate compute and storage clusters because it simplifies scaling—adding a node expands both compute and storage proportionally, one cluster to monitor, one network topology to understand.

The tradeoff is Ceph’s complexity. It has a steep learning curve, it’s unforgiving of misconfiguration, and operational issues can be subtle. I’ve accepted that investment because the scaling and resilience payoff is worth it.

Talos Linux and IaC Provisioning

I chose Talos because it’s immutable and API-driven—no SSH, no shell, no package manager. The entire node OS is configured through a declarative API, and the cluster definition lives in git as Terraform. Rebuilding is a terraform apply away.

The tradeoff is that you can’t shell into a node to debug, which forced heavier investment in observability upfront—Prometheus, Grafana, and Loki compensate for the lack of shell access. I run separate worker pools (general-purpose, high-performance, GPU-capable) so I can right-size resources per workload profile.

Architecture at a Glance

  • Physical layer: Multi-node Proxmox cluster with live migration. Dedicated networks for Ceph traffic, VM traffic, and management.
  • Distributed storage: Ceph with tiered SSD/HDD pools, consumed by Proxmox (RBD) and Kubernetes (Rook CSI). Separate storage classes for block (RBD) and shared filesystem (CephFS).
  • Local storage: ZFS pools for latency-sensitive workloads; TrueNAS for bulk storage and backup targets.
  • Control plane: 3 dedicated Kubernetes masters running etcd and the API server—not workloads.
  • Worker pools: 18 workers across general-purpose, high-performance, and GPU-capable pools.
  • Networking: Cilium (eBPF-based) for CNI, network policies, and traffic observability.
  • Delivery: ArgoCD for GitOps deployment. Changes go through git, not kubectl apply.

Scale

  • 5 physical servers — 896 CPU threads, 3.4 TB RAM
  • 62 TiB Ceph storage — 20 OSDs across tiered SSD/HDD pools
  • 21 Kubernetes nodes — 3 control plane + 18 workers across 3 pools
  • 456 vCPUs and 2.67 TB RAM available for workloads
  • 1x NVIDIA A100 40GB for ML inference and rendering
  • Sub-minute node replacement when hardware fails

Outcome

  • Production-proven. This cluster hosts a multi-tenant SIEM platform accepting logs from over 1,500 remote agents across WAN boundaries, sustaining 6,000–10,000 logs per second around the clock. The cluster has maintained 100% uptime over the past 90 days (as of this writing).
  • Resilient. Designed to tolerate disk and node failures without data loss. Ceph self-heals by re-replicating data across remaining nodes automatically.
  • Rebuildable. I’ve rebuilt the Kubernetes cluster from scratch twice—once for a major Talos upgrade, once to change node topology. Both times it was a day, not a week.
  • Predictable costs. Monthly spend is power and hardware amortization, not usage-based cloud billing.
  • Defensible security posture. No SSH means no shell access to cluster nodes—full stop.
  • Operational clarity. Immutable OS + declarative config means I can always compare actual state to desired state in git.

Technologies

Proxmox VE, Ceph (RBD, CephFS), ZFS, Proxmox Backup Server, TrueNAS, Talos Linux, Kubernetes, Terraform, Rook Ceph, Cilium, ArgoCD, Prometheus, Grafana, Loki.

1 item under this folder.