Distributed Wazuh SIEM Platform
TL;DR: A security monitoring system that watches for threats across multiple clients’ networks from a single platform. The challenge: keeping each client’s data completely separate while avoiding the operational headache of running a separate system for each one. Think of it like a multi-tenant apartment building with soundproof walls—shared infrastructure, private data.
This runs on the same Talos Kubernetes platform I built for other workloads.
Problem
I needed a multi-tenant SIEM that could scale across WAN boundaries without duplicating the full stack for each tenant. The naive approach—one SIEM instance per tenant—works for a handful, but operational overhead scales linearly: every tenant adds another cluster to monitor, another set of upgrades to coordinate, another failure domain to understand. I needed a shared platform where tenants get isolation guarantees without me operating N separate stacks.
The constraints that shaped the design: agents needed to enroll securely from different network segments (some over public internet), tenant data had to be strictly isolated without the index sprawl that comes from per-tenant indices, and the whole thing had to run on infrastructure I control with upgrade paths I can test before production.
How It Works
The data flow is: agents enroll over TLS, Wazuh managers parse and enrich the telemetry, Logstash pipelines route and tag the data, and OpenSearch stores and serves it.
Tenant isolation is the central design problem. Rather than giving each tenant their own indices—which created hundreds of indices, made cluster management painful, and made query performance unpredictable—I put all tenants’ data into shared OpenSearch datastreams and use Document-Level Security (DLS) to enforce isolation. Tenant metadata is attached at ingestion and carried through to storage; DLS filters ensure users can only query their own data. The tradeoff is that DLS configuration is subtle—get the filter wrong and you either leak data or return empty results—so this required careful testing and a clear model of how tenant identity flows end-to-end.
Enrollment security is handled through per-tenant TLS endpoints with unique enrollment tokens, automated via cert-manager. The certificate and token combination establishes tenant identity from the first packet, so identity is enforced at the transport layer rather than relying on configuration that could be misconfigured.
OpenSearch runs as a distributed cluster with dedicated nodegroups—Client nodes for query coordination, Master for cluster state, Hot for recent data, Warm for aged data, Ingest for pipeline processing. This keeps queries from competing with indexing for resources and lets me put hot data on fast storage while warm data lives on cheaper disks. ISM lifecycle policies manage the hot → warm → snapshot → delete progression automatically.
I also built replay-safe ingestion pipelines for handling historical data during onboarding or recovery. These restore original timestamps instead of using ingestion time, so a replay from last month doesn’t trigger current alerts or skew dashboards.
Events flow through to IRIS for case management workflows, MISP for threat intelligence enrichment, and n8n for automation.
flowchart TB Agents["Agents (cross-WAN)<br/>TLS enrollment per tenant"] Managers["Wazuh Managers<br/>- Parse and enrich telemetry<br/>- Per-tenant TLS endpoints<br/>Filebeat (TLS)"] Logstash["Logstash Pipelines<br/>- Routing (alerts vs. archives)<br/>- Tenant metadata enrichment<br/>- Replay handling + timestamp restoration"] OpenSearch["OpenSearch (distributed)<br/>- Client → Master → Hot → Warm → Ingest nodegroups<br/>- Shared datastreams with DLS isolation<br/>- ISM lifecycle: hot → warm → snapshot → delete"] CaseMgmt["Case Management<br/>- IRIS for case workflows<br/>- MISP for threat intel enrichment<br/>- n8n for automation"] Obs["Observability<br/>- Prometheus/Grafana for metrics<br/>- Loki for pipeline logs<br/>- Dashboards for throughput, queue depth, indexing latency"] Agents --> Managers --> Logstash --> OpenSearch --> CaseMgmt --> Obs
Scale
The platform serves over 1,500 agents across tenants, sustaining 6,000–10,000 logs per second around the clock. A distributed OpenSearch cluster spans dedicated nodegroups, multiple Logstash instances handle steady event flow with burst tolerance, and ISM-managed lifecycle with remote snapshot storage provides long-term retention.
Outcome
Adding a tenant no longer means adding a cluster—operational overhead stays flat as the tenant count grows. Analysts can query across their authorized data without federation complexity. Agents enroll securely from anywhere without VPN dependencies. And pipeline observability tells me when things are struggling before they affect analysts.
Technologies
Talos Kubernetes, Wazuh, Logstash, OpenSearch (distributed nodegroups), ISM lifecycle management, IRIS, MISP, n8n, Prometheus, Grafana, Loki, cert-manager.
Related
- On-Premise Kubernetes Platform — The underlying platform
- Observability Platform — The monitoring approach
- Data Pipeline Skills — Patterns I’ve developed for this kind of work