Data Pipeline Accountability System
TL;DR: I automated a billing reconciliation process that used to involve someone printing reports from two different systems and cross-referencing them by hand with a highlighter. Now a dashboard shows exactly which customers are using more storage than they’ve paid for, updated automatically twice a day.
Problem
Two critical business datasets lived in completely separate third-party systems with no native integration between them. One was a CRM/accounting platform that tracked what backup capacity each customer had purchased. The other was a backup storage provider that tracked what customers were actually consuming. Understanding whether customers were within their purchased limits required someone to manually pull reports from both systems, print them, and cross-reference rows with a highlighter.
Overuse went undetected for days or weeks. There was no alerting, no trending, and no way to answer “which customers are exceeding their purchased capacity right now?” without a manual audit. Neither system offered webhooks or push-based APIs—both exposed only REST endpoints that had to be polled—and they didn’t share customer identifiers, so correlation required mapping logic.
How It Works
I wrote lightweight Go agents (I called them “conduits”) that poll each system’s REST API on a twice-daily schedule and produce records into Kafka topics—one topic per source system. I chose Go because the agents needed to be lightweight and concurrent, and single-binary deployment meant no runtime environment to manage. A general-purpose ETL tool would have been overkill for what’s fundamentally “poll an API, transform the response, produce to Kafka.”
Kafka decouples ingestion from processing and—critically—gives me replay capability. When the downstream processing logic changed (and it did, as we refined customer matching rules), I could reprocess historical data from Kafka without re-polling the source APIs. We already had Kafka in the environment, so the marginal cost was configuration rather than new infrastructure.
Custom Kafka Streams applications join purchase records from the CRM topic with consumption records from the backup provider topic, correlating them into unified accountability metrics. The join logic maintains a mapping table (sourced from CRM data) that resolves the different customer identifiers between systems, so new customers are mapped automatically as they appear. Getting the windowed joins right took iteration—since the two sources poll at different intervals, the join windows had to be tuned to avoid missing correlations when one side updated before the other.
Correlated metrics flow into InfluxDB, with Grafana dashboards providing per-customer views of purchases vs. consumption, overuse highlighting, consumption trend lines, and alerting rules that notify the team when customers exceed purchased capacity.
Outcome
What previously required printing spreadsheets and cross-referencing with highlighters became a Grafana dashboard anyone could check in seconds. Customers exceeding purchased capacity are flagged within hours instead of weeks. For the first time, the team could see consumption trends over time and proactively reach out to customers approaching their limits—turning a reactive process into a proactive one.
Technologies
Go, Kafka, Kafka Streams, InfluxDB, Grafana, REST API integration.
Related
- Observability Platform — Similar approach to metrics collection and Grafana visualization