Decian — Principal Software Engineer

6 min read

Decian — Principal Software Engineer

2020 – Present | Platform engineering, security operations, infrastructure, application development

I lead architecture and delivery of a multi-tenant security platform and the supporting on-prem infrastructure at Decian, a Managed Service Provider. In practical terms, this means I’m responsible for everything from the bare metal (Proxmox, Ceph) through the orchestration layer (Kubernetes) to the application layer (SIEM, case management, observability, client-facing tools).

I’m hands-on with architecture decisions and implementation, particularly for infrastructure and platform work. I also lead a team, which means balancing building things myself against enabling others to build things well.

What I’ve Built

Most of the portfolio represents work from this role. The highlights:

SIEM Product

  • Multi-tenant SIEM built on Wazuh with OpenSearch. I architected and launched this as Decian’s core product, enabling the MSP to onboard tenants from a single shared platform instead of duplicating infrastructure per client. The interesting challenge was supporting multiple tenants while guaranteeing isolation. I used Document-Level Security rather than per-tenant indices because index sprawl was making the cluster unmanageable. Cross-WAN enrollment and replay-safe ingestion pipelines allowed onboarding clients with historical data intact. Scaled to over 1,500 agents across WAN boundaries, sustaining 6,000–10,000 logs per second.

  • Case management and enrichment workflows integrating IRIS for case tracking, MISP for threat intelligence, and n8n for automation. These tools connect security events to investigations.

Infrastructure Foundation

  • On-prem infrastructure platform from bare metal up: a hyperconverged Proxmox cluster with 62 TiB Ceph distributed storage (tiered SSD/HDD, survived disk and full node failures without data loss) running a 21-node Kubernetes platform on Talos Linux (456 vCPUs, 2.67 TB RAM, NVIDIA A100 GPU). This cluster hosts the SIEM platform described above, accepting logs from over 1,500 remote agents, and has maintained 100% uptime over the past 90 days (as of this writing). I built this because cloud costs were unpredictable for our workload profile, and I wanted infrastructure I could fully control and understand. Talos eliminated drift by removing SSH entirely—the whole stack is provisioned via Terraform and rebuildable from scratch in hours.

  • Edge routing with HA failover and SNI-based TCP routing. This eliminates cloud load balancer dependencies while maintaining automatic failover within 3–5 seconds via VRRP. HTTPS-only ingress with SNI-based routing keeps the attack surface minimal—two open ports serving all web-facing services.

Observability

  • End-to-end visibility across ingestion, processing, and storage. I built this because generic infrastructure monitoring wasn’t answering the questions we asked during incidents. “CPU is high” doesn’t tell you what to fix; “Logstash queue is backing up because indexing latency spiked” does. Cut incident triage from hours to under 5 minutes by building domain-specific dashboards that correlate metrics and logs across pipeline stages.

Data and Business Operations

  • Data pipeline accountability system that replaced manual spreadsheet-based billing reconciliation with automated same-day overuse detection for the MSP business. I built Go-based polling agents that fed data from disconnected CRM and backup systems into Kafka, with stream joins producing unified billing visibility and trend analysis.

  • HubSpot-to-Airtable connector — a stateless Rust microservice (Axum) that listens for HubSpot CRM webhook events and mirrors data into Airtable via configurable YAML mappings. It supports field-level type coercion, cross-table link lookups, HubSpot association syncing, scheduled full syncs, and soft-delete handling. I built this to replace manual CRM-to-spreadsheet workflows with real-time, event-driven sync—giving operations teams a live, queryable view of CRM data without leaving Airtable.

Client-Facing Applications

  • “Mission Control” training operations portal — Integrated HubSpot (CRM), Airtable (replacing unwieldy Excel spreadsheets with a cloud-based, familiar spreadsheet-style workspace), the HubSpot-to-Airtable connector above, and an installable PWA for a client coordinating national in-person corporate training. The portal handles trainer scheduling, deliverable tracking, calendar timelines, and operational dashboards (Next.js, TypeScript, Chakra UI, Microsoft Entra ID). I designed a local-first, stateless client: IndexedDB as primary read source and a sync layer with queued mutations that retry on reconnect, so coordinators get a fast, reliable tool in the field regardless of connectivity. The connector and PWA are both stateless by design to keep the integration simple.

Delivery

  • GitOps delivery via ArgoCD — I standardized deployment patterns across all services, enabling the team to add tenants and services without proportional growth in operational work.

Why These Decisions

On-prem by default, cloud when justified. Cloud services are convenient, but they’re also opaque and unpredictably expensive. For always-on infrastructure with significant storage requirements, I’ve found that on-prem is more cost-effective and gives me better control. The tradeoff is that I’m responsible for resilience that cloud providers would otherwise handle.

Shared infrastructure with strong isolation. Running separate stacks per tenant doesn’t scale operationally. Every tenant adds another cluster to monitor, another upgrade to coordinate. Shared infrastructure with proper isolation (DLS, RBAC, network policies) is more efficient—but it requires getting the isolation right.

Observability as architecture validation. I think of observability not just as “monitoring” but as a way to validate that systems behave as designed. When I make an architectural assumption (“this pipeline can handle 10k events/sec”), I instrument it so I’ll know when that assumption breaks.

Rust for stateless microservices. I chose Rust for the HubSpot-to-Airtable connector because it compiles to a single binary, has excellent error handling via Result, and runs with minimal resource overhead in Kubernetes. For a webhook-driven sync service that needs to be reliable and lightweight, the investment in Rust’s type system pays off in fewer runtime surprises.

Outcomes

  • Platform reliability through layered observability and predictable storage tiers.
  • Reduced operational toil by standardizing deployment patterns via ArgoCD.
  • Multi-tenant scale without linear infrastructure growth—onboarding tenants from shared infrastructure.
  • Shorter incident triage (under 5 minutes) through correlated pipeline visibility.
  • Automated billing reconciliation replacing manual spreadsheet audits with same-day overuse detection.
  • Real-time CRM sync eliminating manual data entry between HubSpot and Airtable.

What I’ve Learned

  • Infrastructure-as-code pays off slowly, then all at once. The initial investment feels heavy, but when you need to rebuild or reproduce an environment, having everything in git is invaluable.
  • Observability is cheaper than debugging. Investing in instrumentation upfront costs time, but it saves much more time when things go wrong.
  • Isolation is harder than it looks. Multi-tenant systems require thinking about isolation at every layer.
  • Stateless microservices simplify operations. When a service has no state, deployment, scaling, and recovery are all straightforward. I’ve been pushing toward this pattern wherever the problem allows it.

Technologies: Talos Linux, Kubernetes, Proxmox, Ceph, OpenSearch, Wazuh, Logstash, IRIS, MISP, n8n, Prometheus, Grafana, Loki, HAProxy, Keepalived, Terraform, ArgoCD, Cilium, Rust, Go, Python, TypeScript, Next.js, Kafka, InfluxDB.

  • Portfolio — Detailed write-ups of specific projects from this role
  • Skills — Technologies and patterns I’ve developed
  • Leadership — How I think about leading technical work