Technical Deep Dive

5 min read

Technical Deep Dive

This page unpacks the technical specifications of the On-Premise Kubernetes Platform. If you’re looking for the design rationale and tradeoffs, start there.

Production context: This cluster runs a multi-tenant SIEM platform that accepts logs from over 1,500 remote agents across WAN boundaries, sustaining 6,000–10,000 logs per second around the clock. The cluster has maintained 100% uptime over the past 90 days (as of this writing).

Physical Infrastructure: Proxmox Hyperconverged Cluster

The Kubernetes cluster runs on virtual machines provisioned on a 5-node Proxmox hyperconverged cluster. Understanding the physical layer matters because it sets the constraints on what the Kubernetes cluster can do—and explains why I distribute virtual nodes the way I do.

The hyperconverged design means adding a Proxmox node automatically expands both compute capacity and Ceph storage. This is intentional—I wanted infrastructure that scales linearly without separate storage procurement.

Physical Servers

High-Performance Tier (2 servers):

ModelCPURAMRole
Dell PowerEdge R75252x AMD EPYC 7763 64-Core (256 threads)1,007 GBGPU workloads, Pool C workers
Supermicro Super Server2x AMD EPYC 7742 64-Core (256 threads)1,007 GBPool B workers

General-Purpose Tier (3 servers):

ModelCPURAMRole
Supermicro AS-1024US-TRT2x AMD EPYC 7532 32-Core (128 threads)503 GBPool A workers, Control plane, Ceph MON/MGR

Aggregate Physical Resources

ResourceTotal
CPU Threads896
Total RAM3,523 GB (~3.4 TB)
GPU1x NVIDIA A100 40GB

Ceph Distributed Storage

ComponentDetails
Monitors4 (distributed across hosts)
Managers5 (one active, rest standby)
MDS2 active, 2 standby
OSDs20 (all up and in)
Total Capacity62 TiB
Objects3.82M

Virtualization Strategy: Blast Radius Containment

With 5 physical servers and 21 Kubernetes nodes, I had choices about how to distribute virtual machines. The distribution is designed around blast radius containment.

Kubernetes provides resilience regardless of whether a virtual node or a physical host goes down—pods get rescheduled to surviving nodes either way. But splitting each physical server into multiple virtual nodes reduces the blast radius when a single Kubernetes node fails. If I ran one massive VM per physical host, losing that VM (kernel panic, misconfiguration, failed upgrade) would take a large chunk of cluster capacity offline. By running multiple smaller VMs per host, a single VM failure only loses a fraction of that host’s resources.

The tradeoff is overhead: more VMs means more operating system instances, more memory reserved for each VM’s kernel, and more coordination. I sized the VMs to balance blast radius against that overhead—large enough to run meaningful workloads efficiently, small enough that losing one doesn’t cascade.

  • Control plane nodes are distributed one-per-host across the three general-purpose servers. A single host failure loses one of three control plane nodes, leaving quorum intact.
  • Pool A workers are spread 3-per-host across the same three servers. A host failure loses 3 of 9 workers; a single VM failure loses 1 of 9.
  • Pool B and Pool C run on dedicated high-performance hosts. These pools run workloads with application-level redundancy (database replicas, distributed caches) that can tolerate node-level failures.

Kubernetes Cluster Specifications

PropertyValue
Kubernetes Versionv1.32.0
Talos Versionv1.11.6
Total Nodes21
Control Plane Nodes3
Worker Nodes18

Aggregate Resources

ResourceTotal
vCPUs456
RAM2.67 TB
GPU1x NVIDIA A100 40GB

Control Plane Design

I keep the control plane minimal and dedicated—these nodes run etcd and the Kubernetes API server, not workloads. The sizing is intentionally modest because the control plane isn’t where compute-intensive work happens.

PropertyValue
Nodes3
Resources per node8 vCPUs, 32 GB RAM
Total24 vCPUs, 96 GB RAM

The three-node control plane provides quorum for etcd—two nodes can fail before the cluster loses the ability to make scheduling decisions. I distribute them across different Proxmox hosts to ensure a single host failure doesn’t take out the control plane.

Worker Pool Design

I run three worker pools with distinct resource profiles and scheduling characteristics. The reasoning: not all workloads have the same shape, and trying to run everything on a homogeneous pool means either over-provisioning everywhere or starving some workloads.

Pool A – General Purpose (9 nodes)

PropertyValue
Distribution3 workers per host, spread across 3 hosts
Resources per node16 vCPUs, 96 GB RAM
Labelsnode.kubernetes.io/pool=pool-a, node.kubernetes.io/workload=general-purpose
Total144 vCPUs, 864 GB RAM

This pool handles the bulk of workloads—web services, background jobs, platform services. The nodes are sized to run multiple medium-sized pods without contention, and spreading across three hosts provides resilience against single-host failures.

Pool B – High Performance (4 nodes)

PropertyValue
HardwareAMD EPYC 7742 hosts
Resources per node32 vCPUs, 192 GB RAM
Labelsnode.kubernetes.io/pool=pool-b, node.kubernetes.io/workload=high-performance
FeaturesNUMA enabled
Total128 vCPUs, 768 GB RAM

This pool runs latency-sensitive and memory-intensive workloads—databases, caches, search indices. NUMA awareness helps here because these workloads benefit from memory locality. The larger per-node sizing means fewer pods per node, which reduces noisy-neighbor effects.

Pool C – High Performance + GPU (5 nodes)

PropertyValue
HardwareAMD EPYC 7763 hosts
Resources per node32 vCPUs, 192 GB RAM
GPU node1x NVIDIA A100 40GB (PCIe passthrough)
Labelsnode.kubernetes.io/pool=pool-c, node.kubernetes.io/workload=high-performance
GPU Labelsnvidia.com/gpu.present=true
Taints (GPU)nvidia.com/gpu=present:NoSchedule
Total160 vCPUs, 960 GB RAM, 1x A100 GPU

The GPU taint ensures only workloads that explicitly request GPU resources get scheduled to the GPU node. Without this, the scheduler might place general workloads there and starve GPU workloads of the node’s CPU and memory.

Storage Classes

I use multiple storage backends because different workloads have different storage requirements. A database wants block storage with strong consistency; a shared config directory wants a filesystem that multiple pods can mount simultaneously.

Storage ClassTypeDefaultUse Case
ceph-rbdBlock (RBD)YesGeneral workloads requiring persistent block storage
cephfsFilesystemNoShared storage (RWX) for distributed workloads
truenas-iscsiiSCSINoTrueNAS-backed storage for specific performance profiles

Ceph Integration (Rook)

Ceph provides the primary storage tier, integrated via Rook’s CSI driver. The underlying Ceph cluster runs on the same Proxmox hosts, which means storage performance scales with compute—adding a node improves both.

  • RBD (Block): ReadWriteOnce volumes for databases, stateful workloads
  • CephFS (File): ReadWriteMany volumes for shared data across pods

TrueNAS Integration (Democratic CSI)

TrueNAS provides an alternative storage tier via the Democratic CSI driver. I use this for workloads that benefit from ZFS features (snapshots, clones) or need a different performance profile than Ceph provides.