Agentic Engineering Methodology

9 min read

Agentic Engineering Methodology

TL;DR: I developed a repeatable methodology for using LLMs as implementation agents — not autocomplete, not pair programmers, but directed agents operating under strict constraints. The approach centers on TDD-first workflows, explicit acceptance criteria, and least-privilege MCP scoping. It’s what allowed me to ship a Rust microservice, offline-capable PWA, automated billing engine, and observability tooling as a solo engineer at Decian.

Problem

Agentic AI coding tools are genuinely capable. They can produce working code faster than I can type it. But “working code” and “production-quality code” are different things — and the gap between them is where most engineers get burned.

The core problem I needed to solve wasn’t “how do I use AI to write code.” It was: how do I use AI to write code I’d be willing to ship, maintain, and debug at 2am — without introducing hallucinated logic, losing track of requirements mid-session, or spending more time reviewing AI output than I would have spent writing it myself.

This mattered because I was operating as a solo engineer responsible for multiple production systems. I didn’t have the luxury of sloppy output or wasted review cycles. Every hour spent debugging AI-generated nonsense was an hour not spent shipping.

Constraints

Three properties of LLMs shaped every decision in this methodology:

Context decay. LLMs don’t remember. A binding constraint stated clearly in your first prompt is gone by prompt five — not because the model is lazy, but because it’s architecturally incapable of holding unbounded context. Any methodology that relies on the model remembering earlier instructions across a long session will fail.

Assumption-making under ambiguity. When requirements are vague, LLMs don’t ask clarifying questions — they fill in the blanks with plausible-sounding assumptions. Sometimes they guess right. Often they don’t. And the guesses are confident enough that you might not catch them until something breaks in production.

Apparent completion masking real gaps. The code compiles. The happy path works. But edge cases, error handling, and the implicit requirements you never stated? Missing. First-pass LLM output consistently gives the impression of completeness while leaving the hard parts unaddressed.

These aren’t bugs to be fixed in the next model release. They’re structural properties of how LLMs work. Any effective methodology has to route around them, not hope they improve.

Decisions and Tradeoffs

Owner/builder separation

I own architecture, requirements, and acceptance criteria. The LLM implements against those boundaries. This is the foundational decision everything else rests on.

I chose this framing because LLMs are unreliable at architecture decisions and requirements gathering — tasks that depend on implicit organizational context, business judgment, and understanding what “good enough” means in a specific situation. They’re excellent at implementation: turning a clear specification into working code. Separating these concerns plays to each party’s strengths.

The tradeoff is that this requires me to do more upfront specification work than I might with a human developer who can intuit context. But that specification work pays for itself — it eliminates the most expensive failure mode (building the wrong thing fast) and produces artifacts that are useful beyond the AI interaction.

TDD-first workflows

I instruct the model to write tests before implementation. This is the single most effective guardrail I’ve found against context decay and hallucination.

Tests encode requirements in executable form. When the model’s context window loses track of a constraint from five prompts ago, the test still enforces it. When the model halluccinates a plausible-sounding but wrong behavior, the test catches it before I do. The tests also serve as documentation that persists beyond the session — anyone reading the codebase later (including me) can understand what the code is supposed to do by reading what it’s tested against.

The tradeoff is speed on the first iteration. Writing tests first means the model produces less code per prompt cycle. But the total time to production-quality output is consistently shorter because I spend less time in review-and-fix loops.

Strict user story discipline

Every agentic implementation session starts with a user story: background, context, foreseeable challenges, and explicit acceptance criteria. If I can’t explain what “done” looks like precisely enough for the model to verify it, I don’t understand the problem well enough to delegate it.

I arrived at this through painful experience. Early on, I’d hand the model a loose description and iterate toward what I wanted through conversational refinement. This felt productive — the model was responsive, the code was flowing — but the end result was consistently mediocre. Requirements drifted. The model optimized for the most recent prompt rather than the original intent. I’d end up with code that addressed my latest correction but had lost earlier constraints.

Writing the story upfront costs 10-15 minutes. It saves hours of iteration and produces a specification artifact I can reuse if I need to re-implement or hand the work to a different model.

Least-privilege MCP scoping

I invest in connecting Model Context Protocol servers to the development environment — but with tight scoping. The model gets exactly the tools and access it needs for the current task, nothing more.

This matters for two reasons. First, it dramatically expands what the model can do: reading documentation, interacting with APIs, running tests, checking build output. The model goes from generating code in a vacuum to operating within the actual development environment. Second, the scoping prevents the model from causing damage outside the task boundary — no accidental writes to production databases, no commits to wrong branches, no access to systems unrelated to the current work.

The setup cost is real — the first hour of configuring MCPs, SSH keys, and environment scoping doesn’t produce any output. But it pays off across every subsequent session. I treat it the same way I’d treat CI/CD setup: an investment in the development environment that compounds.

Iterative review modeled on code review

I treat every agentic output the way I’d treat a pull request from a capable but inexperienced developer. The code might work, but I check it: error handling, edge cases, naming, consistency with the rest of the codebase, and whether the implementation actually matches the specification rather than a plausible interpretation of it.

This is where the methodology costs the most time relative to “just letting the AI write it.” But it’s also where the quality difference is starkest. Unreviewed AI output accumulates subtle issues — slightly wrong error messages, missing null checks, logging that doesn’t follow the project’s conventions — that individually seem trivial but collectively make a codebase harder to maintain.

What This Enabled

This methodology is what made it feasible for me to ship multiple production systems at Decian as a solo engineer:

  • Mission Control Portal — A stateless Rust/Axum microservice with YAML-driven HubSpot-to-Airtable sync, plus an offline-capable Next.js/React PWA with IndexedDB-backed local-first architecture. Two distinct technology stacks, shipped in parallel.

  • Data Pipeline Accountability — Go-based polling agents, Kafka stream joins, and InfluxDB/Grafana dashboards replacing a manual spreadsheet-and-highlighter audit process. The domain logic (correlating customer identifiers across systems with no shared keys) was specified in tests before the model wrote the join implementation.

  • Pipeline Observability — Real-time visibility across multi-tenant data pipelines. Diagnostic dashboards that answer operational questions rather than just displaying metrics.

  • This digital garden and resume tooling — The site you’re reading, the Handlebars/Puppeteer resume generator, and the content pipeline are all products of agentic workflows operating under this methodology.

None of these were trivial. Each involved multiple technology stacks, production reliability requirements, and integration with existing systems. The methodology didn’t make the problems simpler — it made a solo engineer’s throughput match the scope of the problems.

Outcome

The practical result is a significant multiplier on delivery velocity without sacrificing the quality I’d expect from a carefully staffed team. Production systems that would typically require multiple specialized engineers — backend, frontend, infrastructure, data — were shipped by one person operating with disciplined agentic workflows.

The less obvious result is that this methodology made me better at specification and product management. When every vague requirement becomes a hallucinated implementation within seconds, the feedback loop on specification quality is brutally fast. I write clearer user stories, more precise acceptance criteria, and better-structured architecture documents than I did before — because the cost of ambiguity became immediate and visible.

Current Direction: Multi-Agent Orchestration

The methodology above describes a human directing a single AI agent. The natural next question is: what happens when the same principles — role separation, test-anchored guardrails, scoped access — are applied to a team of specialized agents working together?

I’m actively exploring this with CrewAI, orchestrating multi-agent workflows for end-to-end ERPNext implementations. The target domain is deliberately complex: ERP implementations involve requirements analysis, data modeling, business logic, integration work, and ongoing customization — the kind of work that typically requires a cross-functional team.

The agent roles map directly to the disciplines a real implementation team would staff:

  • Product manager agent — translates business requirements into structured specifications
  • Analyst agent — decomposes specifications into implementable work items with acceptance criteria
  • Coding agents — implement against those specifications (the same owner/builder pattern, but now the “owner” is also an agent operating under human oversight)
  • Adversarial TDD agents — write tests designed to break the implementation, not just confirm it works. This is the “tests as guardrails” principle from Stage 1, but automated and adversarial rather than human-directed
  • Scribe agent — maintains a Quartz-based knowledge base that serves as persistent, shared context across all agents

The scribe agent is the most architecturally interesting piece. Context decay — the biggest constraint in single-agent workflows — becomes a system-level problem when multiple agents need to share understanding of decisions, constraints, and domain knowledge across sessions. The scribe maintains a structured knowledge base (built on the same Quartz framework as this digital garden) that agents read from and write to, creating persistent context that no single agent’s context window needs to hold.

This is active experimentation, not shipped production work. I’m documenting it honestly for that reason. But the progression is real: the same principles that made single-agent workflows reliable are proving to be the foundation for multi-agent orchestration. The discipline scales — and the places where it breaks down are teaching me where the next set of guardrails needs to go.

Technologies

Cursor, Claude (Anthropic), Model Context Protocol (MCP), CrewAI, test-driven development workflows.