
Robert Evans
How We're Building Our Agentic Engineering Teams at Gierd
How We're Building Our Agentic Engineering Teams at Gierd

If you've been paying attention to software engineering conversations over the last year, you already know the headline: the work has shifted from writing code to coordinating agents that write code. The Anthropic trends report says it. The conferences say it. Every consultancy with a website says it.
What almost nobody is being specific about is how you actually build the team. Not the agent — the team. Single-agent coordination is solved enough that it doesn't deserve a blog post anymore. Multi-agent teams that hold up under production load, on real customer-facing systems, with real failure costs, are still mostly theory and slide decks.
At Gierd, we already have one agentic engineering team running in production. Our self-healing data pipeline — the system Tony Ojeda wrote about recently — is operated by a small team of specialized agents that detect issues, investigate them, write the fix, and open a pull request for human review. It's been doing real work on real customer data every day, and the lessons from operating it are what's shaping how we're building the rest of our agentic engineering teams.
This post is about that next stage. The agentic teams that will own the broader Gierd platform — the application code, the integrations, the customer-facing surfaces — are in active development. The architecture is designed. The agents, skills, and gates have been validated through isolated tests against real engineering tasks. The full production rollout is the work in front of us, and what we've learned from the data pipeline team is informing every choice we're making.
This post walks through how those teams are structured, what each agent is responsible for, how the work is checked at every handoff, and the principles that have held up through testing.
The Two Primitives: Agents and Skills
Before we get to the team, we need to talk about the two pieces everything else is built from.
An agent is a who, what, and why. Who it is. What it does. Why it makes the calls it makes when a situation gets ambiguous. That's the entire definition. An architect agent and an engineer agent can have access to the exact same information and produce different outputs, because they're different agents — different jobs, different values about what good looks like.
A skill is the how. How you write a Postgres migration in our codebase. How you structure a Stimulus controller the way our system wants it structured. How we name things, how we test things, how we handle errors. A skill is the kind of knowledge a senior engineer carries in their head and applies without thinking — written down, made portable, and made loadable by any agent that needs it.
Skills are how we encode our team's taste and style. The goal is straightforward: any engineer should be able to jump into any part of the codebase and work on it without having to learn a new dialect. That's hard to achieve with humans alone, because taste lives in heads and propagates through code review and tribal memory. With skills, the conventions are written down, applied consistently, and improved in one place. When we tighten a pattern, every agent using that skill picks it up the next time it runs — and the codebase converges instead of drifting.
What the Data Pipeline Team Taught Us
The self-healing data pipeline runs on three specialized agents — a Research Agent that investigates issues, a Resolution Agent that writes the fix and opens a pull request, and a Revision Agent that responds to review feedback. It's a focused team with a focused mandate: keep the pipeline healthy, and when something breaks, do the forensic work, write the repair, and put a reviewable pull request in front of an engineer.
That team has been live long enough to teach us a few things that matter for everything we're building next.
The first is that narrow agent responsibilities outperform broad ones, every time. The Research Agent isn't trying to fix anything — it's only trying to understand. The Resolution Agent isn't trying to investigate — it's working from an investigation that's already done. Each agent does one thing, hands off cleanly, and the system as a whole is more reliable because no individual agent is doing too many jobs.
The second is that the structured handoff is where the value compounds. The Research Agent doesn't just produce an answer; it produces a structured root-cause analysis that the next agent can act on. The Resolution Agent doesn't just produce code; it produces a pull request with context for the reviewer. Every handoff is a written artifact, which means every step is auditable, every step is improvable, and every step gives the next party — human or agent — a working starting point instead of a blank page.
The third is that humans-in-the-loop isn't a temporary phase. The data pipeline team has never pushed a change to production on its own, and that's not a constraint we plan to remove. The agents are fast and thorough; the humans are the judgment layer. That division has been the thing that's let us trust the system in production from day one.
Those three lessons — narrow responsibilities, structured handoffs, humans owning judgment — are the foundation of how we're building the agentic engineering teams that will own the broader platform. The data pipeline team is a focused operations system. What we're building next is more ambitious: a team that designs, implements, reviews, and verifies new software end-to-end. The principles are the same. The architecture has to be more sophisticated to handle the wider scope.
The Team: Three Roles, Three Models
Once you have agents and skills as your building blocks, the question becomes how you assemble them into a team. Our approach mirrors what human engineering organizations arrived at decades ago, for the same reason: separate the work by the kind of thinking it requires.
Each of our agentic engineering teams has three working roles, plus a set of stage gates between them that we'll get to in a moment.
The Architect takes a spec and produces a technical plan — the design decisions, the data flow, the tradeoffs, the edge cases worth flagging upfront. Before producing the plan, the Architect reads the relevant parts of the codebase. A plan written in isolation from the existing system is a plan that's going to collide with the existing system at implementation time. Knowing what's already there — the patterns in use, the conventions, the constraints the code is operating under — has to be part of the plan, which means it has to be part of the planning.
The Architect needs a reasoning model, because the cost of a missed implication shows up three steps later in the workflow when a downstream agent is faithfully implementing the wrong thing. We use Opus for the Architect for exactly that reason. The marginal cost of the better model is dwarfed by the upstream cost of a bad plan.
The Engineer takes that plan and writes the code. The Engineer's job is faithfulness to the plan, with skills loading the patterns to follow. This is execution, not novel reasoning — and Sonnet handles it well, runs faster, and costs less per task. We've also found, through our isolated testing, that engineering output improves measurably when the Engineer follows test-driven development strictly: write the failing test first, write the code that makes it pass, then refactor. Agents are remarkably good at TDD when you require it. The discipline keeps the implementation honest, surfaces ambiguities early, and produces code that's verified by construction rather than verified after the fact.
The Code Reviewer is where the architecture gets interesting. A single agent trying to review code holistically is doing too many jobs at once. So we don't do that. The Code Reviewer is actually an orchestrator that runs many small review agents in parallel — one rule per agent, each looking for a single specific class of violation. Naming conventions. Error handling. Test coverage. Security patterns. Logging standards. Each agent is narrow enough to run on Haiku, because the job is small and the context is tight. The orchestrator collects the findings and produces the consolidated review.
This is the inversion most teams miss. They reach for the smartest model on the review pass because review feels like high-stakes thinking. It isn't. Review is a high volume of small, well-scoped checks — exactly the workload a fleet of small models running in parallel is best at. The reasoning happens in the Architect. The execution happens in the Engineer. The review is pattern matching at scale, and pattern matching at scale wants Haiku, not Opus.
The result is a team where every model is doing the work it's best at, and nothing is overpaying for cognition it doesn't need.
The Stage Gates: Three Checks, Three Sources of Truth
A team of agents doing engineering work isn't, by itself, production-grade. The agents will do excellent work on a well-formed handoff and produce confident garbage on an under-formed one. The variance is the problem — and the handoffs are where the variance lives.
Three stage gates address that, each doing different work against a different source of truth.
The Plan Reviewer runs after the Architect and before the Engineer. Its job is to read the plan against the spec and find what's missing. Did the Architect address every requirement? Did they make assumptions that should have been escalated? Is there a constraint in the spec that doesn't show up in the plan? The Plan Reviewer is the gate that prevents the most expensive class of failure — the Engineer faithfully implementing a plan that was already wrong. Catching it here costs a re-plan. Catching it after implementation costs a rewrite.
The Plan Reviewer is also where our ambiguity scanner runs. The scanner reads the Architect's plan and surfaces the assumptions a model would silently make — places where the plan is committing to something the spec didn't actually decide. We run it specifically on the plan because the plan is the highest-stakes input in the entire workflow. Every downstream agent acts on the plan. A silent assumption baked in here propagates through implementation, review, and verification before anyone sees it. The industry has converged on the point that weak input is the root cause of most agent failure, and context engineering is the discipline that addresses it. The scanner is how we operationalize that, at the moment in the workflow where it matters most.
When the scanner surfaces an ambiguity, the question goes to the user, not back to the Architect. The user resolves it or accepts it on the record, and the work moves forward only after every flagged assumption has an answer. No agent downstream of the plan ever acts on a silent assumption.
The Implementation Reviewer runs after the Engineer and reads the implementation against the plan. Did the Engineer build everything in the plan? Did they add things that weren't in the plan? This is the gate that catches drift between what was decided and what was built. It's a different question than code review — code review asks whether the code is good; implementation review asks whether the code is what was asked for. Two different questions, two different gates.
The Spec Reviewer is the final pass, and the one that matters most for closing the loop. It reads the finished implementation against the original spec — the same spec the Architect started from. The point is to verify that what shipped actually answers what was asked for. The Architect could have built a plan that answered most of the spec. The Plan Reviewer could have approved it. The Engineer could have implemented it perfectly. The Implementation Reviewer could have approved that. And the thing that ships could still miss the original ask, because every gate in the middle was checking against the gate before it instead of against the source of truth. The Spec Reviewer goes back to the source.
Three gates, three different questions, three different sources of truth. Plan against spec. Implementation against plan. Implementation against spec.
When any of those gates finds an issue — a missed requirement, a drift from the plan, a gap from the original spec — it doesn't just flag and stop. It generates a structured report explaining exactly what was found, where, and why it's a problem, saves that report to disk, and hands it back to the agent responsible for the work. The Architect gets a report from the Plan Reviewer. The Engineer gets a report from the Implementation Reviewer. The agent reads the report, addresses the findings, and the work re-enters the gate. It's the same loop a human engineering team runs in code review — written feedback, addressed feedback, re-review — except the cycle time is minutes instead of days.
Automated Testing Is the Other Half of the Story
There's a temptation, when you're building a system like this, to treat the agentic team itself as the whole accomplishment. It isn't. The team is only as trustworthy as the tests that run against what it produces.
This is the part of agentic engineering most teams underinvest in, and it's a mistake. When agents are writing code at the pace agents write code, the bottleneck shifts almost immediately to verification. A team that ships a hundred pull requests a week needs a testing infrastructure that can keep up with a hundred pull requests a week, and the only way to get there is to automate end-to-end user testing aggressively.
Playwright is one of the tools we use for this, alongside other testing tools that exercise the system from the customer's perspective rather than testing isolated functions in isolation. The Engineer writes these tests as part of the TDD loop. The test suite runs in CI on every change. The Implementation Reviewer factors test results into its review. When the suite catches a regression, the report goes back to the Engineer the same way any other gate finding does.
The principle is simple: if a customer can do it in your product, a test should do it automatically, and that test should run on every change. The agents are happy to write those tests. The discipline of demanding them is on you.
Humans Stay in the Loop
Worth being explicit about this: when the agentic engineering teams reach production, they won't ship to production on their own. Every change the system produces will be reviewed by an experienced engineer before it merges. The agents are fast and thorough, and the gates catch a lot in testing — but human judgment is the final pass on anything that touches a customer. That's not a temporary stage we plan to grow out of. It's the design.
Our engineers' role is shifting along with the architecture. Instead of writing the code, they'll be reviewing the work. Instead of investigating from scratch, they'll be evaluating an analysis the system already produced. Their time goes to judgment calls, edge cases, architectural direction, and the kind of cross-cutting decisions that benefit from a human who's seen the whole system. The repetitive labor underneath — the implementation work, the rule-by-rule review, the test-by-test verification — gets handled by the team.
We're also building observability into every agent from day one. Every decision, every tool call, every output is traceable. When something goes wrong, and things will go wrong, we can see exactly what each agent did and why it made the choices it made. This is the difference between debugging a system and guessing about a system.
The Principle Underneath
Most engineering organizations are still figuring out what AI in their workflow even looks like. The honest answer is that single-agent coordination — one model, one chat, one developer — is the easy version. It's also the version that's going to age fastest.
The teams that hold up are the ones built like teams: separated roles, matched models, gated handoffs, automated verification, and humans owning the judgment calls. That's the foundation. Everything else is built on top of it.
We're going to keep extending this. There are categories of engineering work the architecture doesn't cover yet. There are gates that can get sharper. There's always more to automate in testing. The isolated tests have given us conviction that the structure holds. The rollout is the work in front of us, and it's the work we're doing — building the foundation that the platform our customers depend on will run on.
If you've been paying attention to software engineering conversations over the last year, you already know the headline: the work has shifted from writing code to coordinating agents that write code. The Anthropic trends report says it. The conferences say it. Every consultancy with a website says it.
What almost nobody is being specific about is how you actually build the team. Not the agent — the team. Single-agent coordination is solved enough that it doesn't deserve a blog post anymore. Multi-agent teams that hold up under production load, on real customer-facing systems, with real failure costs, are still mostly theory and slide decks.
At Gierd, we already have one agentic engineering team running in production. Our self-healing data pipeline — the system Tony Ojeda wrote about recently — is operated by a small team of specialized agents that detect issues, investigate them, write the fix, and open a pull request for human review. It's been doing real work on real customer data every day, and the lessons from operating it are what's shaping how we're building the rest of our agentic engineering teams.
This post is about that next stage. The agentic teams that will own the broader Gierd platform — the application code, the integrations, the customer-facing surfaces — are in active development. The architecture is designed. The agents, skills, and gates have been validated through isolated tests against real engineering tasks. The full production rollout is the work in front of us, and what we've learned from the data pipeline team is informing every choice we're making.
This post walks through how those teams are structured, what each agent is responsible for, how the work is checked at every handoff, and the principles that have held up through testing.
The Two Primitives: Agents and Skills
Before we get to the team, we need to talk about the two pieces everything else is built from.
An agent is a who, what, and why. Who it is. What it does. Why it makes the calls it makes when a situation gets ambiguous. That's the entire definition. An architect agent and an engineer agent can have access to the exact same information and produce different outputs, because they're different agents — different jobs, different values about what good looks like.
A skill is the how. How you write a Postgres migration in our codebase. How you structure a Stimulus controller the way our system wants it structured. How we name things, how we test things, how we handle errors. A skill is the kind of knowledge a senior engineer carries in their head and applies without thinking — written down, made portable, and made loadable by any agent that needs it.
Skills are how we encode our team's taste and style. The goal is straightforward: any engineer should be able to jump into any part of the codebase and work on it without having to learn a new dialect. That's hard to achieve with humans alone, because taste lives in heads and propagates through code review and tribal memory. With skills, the conventions are written down, applied consistently, and improved in one place. When we tighten a pattern, every agent using that skill picks it up the next time it runs — and the codebase converges instead of drifting.
What the Data Pipeline Team Taught Us
The self-healing data pipeline runs on three specialized agents — a Research Agent that investigates issues, a Resolution Agent that writes the fix and opens a pull request, and a Revision Agent that responds to review feedback. It's a focused team with a focused mandate: keep the pipeline healthy, and when something breaks, do the forensic work, write the repair, and put a reviewable pull request in front of an engineer.
That team has been live long enough to teach us a few things that matter for everything we're building next.
The first is that narrow agent responsibilities outperform broad ones, every time. The Research Agent isn't trying to fix anything — it's only trying to understand. The Resolution Agent isn't trying to investigate — it's working from an investigation that's already done. Each agent does one thing, hands off cleanly, and the system as a whole is more reliable because no individual agent is doing too many jobs.
The second is that the structured handoff is where the value compounds. The Research Agent doesn't just produce an answer; it produces a structured root-cause analysis that the next agent can act on. The Resolution Agent doesn't just produce code; it produces a pull request with context for the reviewer. Every handoff is a written artifact, which means every step is auditable, every step is improvable, and every step gives the next party — human or agent — a working starting point instead of a blank page.
The third is that humans-in-the-loop isn't a temporary phase. The data pipeline team has never pushed a change to production on its own, and that's not a constraint we plan to remove. The agents are fast and thorough; the humans are the judgment layer. That division has been the thing that's let us trust the system in production from day one.
Those three lessons — narrow responsibilities, structured handoffs, humans owning judgment — are the foundation of how we're building the agentic engineering teams that will own the broader platform. The data pipeline team is a focused operations system. What we're building next is more ambitious: a team that designs, implements, reviews, and verifies new software end-to-end. The principles are the same. The architecture has to be more sophisticated to handle the wider scope.
The Team: Three Roles, Three Models
Once you have agents and skills as your building blocks, the question becomes how you assemble them into a team. Our approach mirrors what human engineering organizations arrived at decades ago, for the same reason: separate the work by the kind of thinking it requires.
Each of our agentic engineering teams has three working roles, plus a set of stage gates between them that we'll get to in a moment.
The Architect takes a spec and produces a technical plan — the design decisions, the data flow, the tradeoffs, the edge cases worth flagging upfront. Before producing the plan, the Architect reads the relevant parts of the codebase. A plan written in isolation from the existing system is a plan that's going to collide with the existing system at implementation time. Knowing what's already there — the patterns in use, the conventions, the constraints the code is operating under — has to be part of the plan, which means it has to be part of the planning.
The Architect needs a reasoning model, because the cost of a missed implication shows up three steps later in the workflow when a downstream agent is faithfully implementing the wrong thing. We use Opus for the Architect for exactly that reason. The marginal cost of the better model is dwarfed by the upstream cost of a bad plan.
The Engineer takes that plan and writes the code. The Engineer's job is faithfulness to the plan, with skills loading the patterns to follow. This is execution, not novel reasoning — and Sonnet handles it well, runs faster, and costs less per task. We've also found, through our isolated testing, that engineering output improves measurably when the Engineer follows test-driven development strictly: write the failing test first, write the code that makes it pass, then refactor. Agents are remarkably good at TDD when you require it. The discipline keeps the implementation honest, surfaces ambiguities early, and produces code that's verified by construction rather than verified after the fact.
The Code Reviewer is where the architecture gets interesting. A single agent trying to review code holistically is doing too many jobs at once. So we don't do that. The Code Reviewer is actually an orchestrator that runs many small review agents in parallel — one rule per agent, each looking for a single specific class of violation. Naming conventions. Error handling. Test coverage. Security patterns. Logging standards. Each agent is narrow enough to run on Haiku, because the job is small and the context is tight. The orchestrator collects the findings and produces the consolidated review.
This is the inversion most teams miss. They reach for the smartest model on the review pass because review feels like high-stakes thinking. It isn't. Review is a high volume of small, well-scoped checks — exactly the workload a fleet of small models running in parallel is best at. The reasoning happens in the Architect. The execution happens in the Engineer. The review is pattern matching at scale, and pattern matching at scale wants Haiku, not Opus.
The result is a team where every model is doing the work it's best at, and nothing is overpaying for cognition it doesn't need.
The Stage Gates: Three Checks, Three Sources of Truth
A team of agents doing engineering work isn't, by itself, production-grade. The agents will do excellent work on a well-formed handoff and produce confident garbage on an under-formed one. The variance is the problem — and the handoffs are where the variance lives.
Three stage gates address that, each doing different work against a different source of truth.
The Plan Reviewer runs after the Architect and before the Engineer. Its job is to read the plan against the spec and find what's missing. Did the Architect address every requirement? Did they make assumptions that should have been escalated? Is there a constraint in the spec that doesn't show up in the plan? The Plan Reviewer is the gate that prevents the most expensive class of failure — the Engineer faithfully implementing a plan that was already wrong. Catching it here costs a re-plan. Catching it after implementation costs a rewrite.
The Plan Reviewer is also where our ambiguity scanner runs. The scanner reads the Architect's plan and surfaces the assumptions a model would silently make — places where the plan is committing to something the spec didn't actually decide. We run it specifically on the plan because the plan is the highest-stakes input in the entire workflow. Every downstream agent acts on the plan. A silent assumption baked in here propagates through implementation, review, and verification before anyone sees it. The industry has converged on the point that weak input is the root cause of most agent failure, and context engineering is the discipline that addresses it. The scanner is how we operationalize that, at the moment in the workflow where it matters most.
When the scanner surfaces an ambiguity, the question goes to the user, not back to the Architect. The user resolves it or accepts it on the record, and the work moves forward only after every flagged assumption has an answer. No agent downstream of the plan ever acts on a silent assumption.
The Implementation Reviewer runs after the Engineer and reads the implementation against the plan. Did the Engineer build everything in the plan? Did they add things that weren't in the plan? This is the gate that catches drift between what was decided and what was built. It's a different question than code review — code review asks whether the code is good; implementation review asks whether the code is what was asked for. Two different questions, two different gates.
The Spec Reviewer is the final pass, and the one that matters most for closing the loop. It reads the finished implementation against the original spec — the same spec the Architect started from. The point is to verify that what shipped actually answers what was asked for. The Architect could have built a plan that answered most of the spec. The Plan Reviewer could have approved it. The Engineer could have implemented it perfectly. The Implementation Reviewer could have approved that. And the thing that ships could still miss the original ask, because every gate in the middle was checking against the gate before it instead of against the source of truth. The Spec Reviewer goes back to the source.
Three gates, three different questions, three different sources of truth. Plan against spec. Implementation against plan. Implementation against spec.
When any of those gates finds an issue — a missed requirement, a drift from the plan, a gap from the original spec — it doesn't just flag and stop. It generates a structured report explaining exactly what was found, where, and why it's a problem, saves that report to disk, and hands it back to the agent responsible for the work. The Architect gets a report from the Plan Reviewer. The Engineer gets a report from the Implementation Reviewer. The agent reads the report, addresses the findings, and the work re-enters the gate. It's the same loop a human engineering team runs in code review — written feedback, addressed feedback, re-review — except the cycle time is minutes instead of days.
Automated Testing Is the Other Half of the Story
There's a temptation, when you're building a system like this, to treat the agentic team itself as the whole accomplishment. It isn't. The team is only as trustworthy as the tests that run against what it produces.
This is the part of agentic engineering most teams underinvest in, and it's a mistake. When agents are writing code at the pace agents write code, the bottleneck shifts almost immediately to verification. A team that ships a hundred pull requests a week needs a testing infrastructure that can keep up with a hundred pull requests a week, and the only way to get there is to automate end-to-end user testing aggressively.
Playwright is one of the tools we use for this, alongside other testing tools that exercise the system from the customer's perspective rather than testing isolated functions in isolation. The Engineer writes these tests as part of the TDD loop. The test suite runs in CI on every change. The Implementation Reviewer factors test results into its review. When the suite catches a regression, the report goes back to the Engineer the same way any other gate finding does.
The principle is simple: if a customer can do it in your product, a test should do it automatically, and that test should run on every change. The agents are happy to write those tests. The discipline of demanding them is on you.
Humans Stay in the Loop
Worth being explicit about this: when the agentic engineering teams reach production, they won't ship to production on their own. Every change the system produces will be reviewed by an experienced engineer before it merges. The agents are fast and thorough, and the gates catch a lot in testing — but human judgment is the final pass on anything that touches a customer. That's not a temporary stage we plan to grow out of. It's the design.
Our engineers' role is shifting along with the architecture. Instead of writing the code, they'll be reviewing the work. Instead of investigating from scratch, they'll be evaluating an analysis the system already produced. Their time goes to judgment calls, edge cases, architectural direction, and the kind of cross-cutting decisions that benefit from a human who's seen the whole system. The repetitive labor underneath — the implementation work, the rule-by-rule review, the test-by-test verification — gets handled by the team.
We're also building observability into every agent from day one. Every decision, every tool call, every output is traceable. When something goes wrong, and things will go wrong, we can see exactly what each agent did and why it made the choices it made. This is the difference between debugging a system and guessing about a system.
The Principle Underneath
Most engineering organizations are still figuring out what AI in their workflow even looks like. The honest answer is that single-agent coordination — one model, one chat, one developer — is the easy version. It's also the version that's going to age fastest.
The teams that hold up are the ones built like teams: separated roles, matched models, gated handoffs, automated verification, and humans owning the judgment calls. That's the foundation. Everything else is built on top of it.
We're going to keep extending this. There are categories of engineering work the architecture doesn't cover yet. There are gates that can get sharper. There's always more to automate in testing. The isolated tests have given us conviction that the structure holds. The rollout is the work in front of us, and it's the work we're doing — building the foundation that the platform our customers depend on will run on.
Strategy & Tactics
Why Wait? Start Smarter Marketplace Growth Today!
Why Wait? Start Smarter Marketplace Growth Today!

















Why Wait? Start Smarter Marketplace Growth Today!








Smarter Marketplace Commerce.
Unstoppable Growth.
•
120 East Lake Street, Suite 401, Sandpoint, ID 83864, USA
•
Smarter Marketplace Commerce.
Unstoppable Growth.
120 East Lake Street, Suite 401, Sandpoint, ID 83864, USA

