How We Built a Self-Healing Data Pipeline at Gierd

If you sell across Amazon, Walmart, Target, eBay, Best Buy, and a half-dozen other marketplaces, you already know the quiet truth of modern commerce: your business is only as good as your data. Decisions about pricing, inventory, advertising spend, and forecasting all rest on numbers flowing in from dozens of different systems — each with its own cadence, its own quirks, and its own failure modes.

When that data goes stale, incomplete, or quietly wrong, the consequences ripple fast. A pricing team works from week-old inventory counts. A finance team reconciles against payouts that never landed. An operations team optimizes for a channel whose feed has been silently failing for three days.

At Gierd, we pull together a lot of data from a lot of places on behalf of our customers — orders, transactions, inventory, traffic, advertising, and financials, all landing in a unified schema our customers run their businesses on. Keeping all of that flowing cleanly is a full-time job. So we built a system that does that job for us — one that watches the pipeline around the clock, catches problems the moment they happen, and increasingly, fixes them on its own.

We call it our self-healing data pipeline. It's made up of two components: Gierd Diagnostics and Gierd Resolve.

Gierd Diagnostics: The Watchtower

Gierd Diagnostics is the monitoring layer. Four times a day, it runs across every active customer project and inspects every feed we ingest — a thorough, methodical sweep from end to end of the pipeline.

At each checkpoint, it asks four straightforward but critical questions of every feed:

Is the data there? Are the expected tables populated at all?
Is it recent? Has anything suddenly stopped flowing?
Is it complete? Does today's data look reasonable compared to history?
Is the source healthy? Are any of our marketplace integrations returning errors or timing out?

What makes Gierd Diagnostics particularly sharp is that it doesn't rely on one-size-fits-all rules. Marketplace data doesn't behave uniformly — some feeds arrive every hour, some land once a day, and some marketplaces have structural lag baked in. Amazon's traffic reports, for example, consolidate on a three-day delay by design. A static "updated in the last 24 hours" check would either miss real outages or fire false alarms constantly.

To address this, the system computes adaptive staleness thresholds per customer and per data type, based on how that data actually arrives in practice. A three-day delay might be entirely normal for one feed and a serious outage for another. The system learns the rhythm of each source and only alerts when something falls outside it.

When it does find something off, it's disciplined about how it tells us. Instead of spamming a channel, it groups related failures into a single logical issue — for instance, Data Stale — Amazon Inventory across 14 customers — posts one Slack notification, and opens one ticket on our Triage board in Linear (our engineering project tracker) with a severity level, the list of affected accounts, and all the context needed to investigate. As long as the issue persists, the ticket stays open and each run appends which accounts are still affected. When the issue resolves, the ticket closes itself.

That discipline matters. The fastest way to make humans ignore alerts is to send too many of them. That alone is a meaningful improvement over the old way of doing things, which often involved a customer noticing something looked wrong in a report and reaching out to ask. But we didn't want to stop there.

Gierd Resolve: Investigation and Repair

A triage ticket used to mean the same thing every pipeline alert means everywhere else: a human engineer stops what they're doing and starts investigating the issue. Depending on the problem, that could take anywhere from twenty minutes to a full afternoon — most of it forensic work, tracing lineage, checking recent runs, comparing the broken state to historical behavior.

Once Gierd Diagnostics has created a ticket, Gierd Resolve takes over. It's a small team of specialized AI agents, each with narrow responsibilities, that collectively turn a freshly-opened ticket into — in many cases — a finished fix waiting for human approval.

Here's how the work is divided:

The Research Agent handles investigation. When a new ticket lands, it reads the alert, queries the underlying data to understand the scope and timeline of the issue, pulls the relevant model source from GitHub, traces the data lineage back through our transformations to figure out where the problem originated, and produces a structured root-cause analysis. It posts that analysis back onto the ticket so the next party — human or agent — who picks it up starts with a working hypothesis rather than a blank page. This step typically takes three or four minutes and costs less than a dollar.

The Resolution Agent handles the fix. For the class of issues that can be addressed by updating our data transformation code, the Resolution Agent implements the change. It modifies the code, validates it against our standards (Does it still parse? Does it still compile? Does it follow our formatting rules?), and opens a pull request for our data engineers to review.

The Revision Agent handles review feedback. When a data engineer reviews a pull request and leaves comments — "this edge case isn't handled," "rename this column," "add a test here" — the Revision Agent reads that feedback and makes the requested changes automatically. It can go through up to two rounds of revisions on its own before handing things back to a human for a final decision.

The full lifecycle — from the pipeline detecting an issue to a proposed fix sitting in a pull request with an explanation of what went wrong and why the fix works — often completes in minutes. Issues that might have sat in a backlog for days get investigated immediately, and the cost to run the agents on a given issue is a small fraction of the engineering time they replace.

Humans Stay in the Loop

One point worth being explicit about: our AI agents don't push changes to production on their own. Every fix the system drafts is reviewed by an experienced data engineer before it merges. The agents are fast and thorough, but they're also bounded — they operate under strict cost and scope limits, they only touch the systems they're allowed to touch, and every change they make is visible, reviewable, and reversible.

Our engineers aren't out of the loop; their role has shifted. Instead of writing the fix, they're reviewing one. Instead of investigating a problem from scratch, they're evaluating an analysis. Their time goes to judgment calls, edge cases, and new work, while the system handles the repetitive investigative labor underneath.

What This Means for You

If you're a Gierd customer, most of this is invisible to you — and that's the point. Behind the scenes, we've built a pipeline that:

Catches problems fast. Most issues are identified and triaged within hours of appearing, long before they show up in a report you're reading.
Fixes common problems faster. Many fixes land the same day, sometimes within minutes of being detected.
Keeps a detailed paper trail. Every alert, investigation, and fix is logged and auditable — both as a training signal for future improvements and as hard evidence of what's actually happening inside your marketplace operations over time. Most teams guess at that. We can show you.
Scales with our customer base. Adding more marketplaces, more accounts, and more data sources doesn't mean proportionally more manual oversight — the system grows with us.

The end result is a simple promise: when you look at your data in Gierd, you can trust that it's current, it's complete, and that if something does slip, we almost certainly knew about it before you did — and were already on it.

The Principle Underneath

Most data teams are stuck in a reactive posture. A dashboard looks wrong, someone asks a question, an engineer goes looking. The self-healing pipeline flips that posture: the system itself is responsible for noticing, diagnosing, and proposing fixes, and humans do what humans are best at — judgment and review.

We're going to keep building on this. There are categories of issues Resolve can't yet touch and edges where the diagnostic rules still need to learn. The next frontier is giving the agents more kinds of problems they can safely resolve on their own, and making the whole system even better at explaining what it found and why. But the foundation is in place — and it's already doing real work every day.

Stale data shouldn't be something our customers have to worry about. With Gierd Diagnostics and Gierd Resolve on the job, it isn't.