Before you dive in: I recently launched Flight Lens—real-time flight intelligence for anyone who flies. A Pulse Index shows global aviation status, smart alerts track delays and price drops, and a live map lets you follow any aircraft. Use code LAUNCH for 50% off annual plan ($19.99 instead of $39.99).
Last week I read Addy Osmani’s piece on long-running agents. Halfway through I stopped and went back to the top, because something kept pinging me.
Long running agents — agents that keep working for hours or days, across many sessions, recovering from failure, leaving artifacts behind, picking up where they left off — are all surprisingly similar, architecturally. Anthropic uses an approach called “brain / hands / session”: separate the model loop from the execution sandbox from the durable event log. Cursor calls it “planner / worker / judge”. The Anthropic harness team calls it “plan / generate / evaluate”. Three independent efforts, three different vocabularies, but really the same architecture.
I read that and thought: that’s just Assess, Decide, Do, but with different words.
Not “ADD-flavored”. Not “looks a bit like ADD if you squint”. The exact same shape. Assess / Decide / Do, with a verifier. The framework I’ve been using for fifteen years to manage my own scattered attention is the same framework the labs are landing on for autonomous agents.
So I started designing.
This post is about the result so far — a long running agent harness built on ADD, layered onto AIGernon. The implementation just started, but at the moment of writing is still incomplete. I’ll come back with another post when I have some meaningful results – hopefully soon.
I’m writing this now because the design itself surfaced two things worth sharing: a concept I hadn’t seen before in any of the agent literature (I’m calling it Substrate), and a structural answer to one of the unsolved problems Addy lists at the end of his post.
The Three Walls Every Long Running Agent Hits
Every long running agent hits the same three walls.
Finite context. Even a million-token window fills up. And context rot — the steady degradation as the window gets full — kicks in well before the hard limit. A 24-hour run is not going to fit in any context window the field has on its roadmap.
No persistent state. A new session starts blank. The cleanest framing I’ve seen is from Anthropic: imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift. Without persistence, every shift change is a productivity disaster.
No self-verification. Models reliably skew positive when grading their own work. Asked “are you done?” they answer “yes” more often than they should. Without an independent check, the agent ships at 30% complete with full confidence.
The three converged architectures above are mostly answers to these three walls. Separate the model loop from the sandbox from the log (solves persistence). Compact and reset context aggressively (solves window). Use a different prompt or a different model to grade the work (solves self-verification).
Why ADD Already Has This Baked In
ADD divides cognition into three realms:
- Assess — evaluate the situation, generate options, brainstorm, day dream, define what done looks like
- Decide — allocate resources, sequence steps, plan the how and with what
- Do — implement the plan, pure execution
When you read Cursor’s planner / worker / judge, the planner is Assess. The worker is Do. And the judge — well, this is where it got interesting.
While designing this, I had to correct two assumptions I’d been making.
First, Decide is not verification. Decide is resource allocation and planning — the how, not the grading. I’d been mentally collapsing “judge” into Decide, which would have mapped wrong onto every harness pattern in the literature. Decide commits to a path. It doesn’t grade the path after.
Second, verification isn’t a fourth realm. It’s a transition. I’m calling it Ratify: when Do reports completion, you compare the produced artifact against the criteria committed during Assess. Pass means the unit gets archived. Fail means… and this is the load-bearing part… fail means back to Assess. Not back to Do.
That single rule is the structural answer to the dominant long running agent failure mode. The dominant failure isn’t that Do messes up. It’s that the goal was misunderstood, and the misunderstanding compounded silently for hours. If you re-run Do on a Ratify failure, you make the misunderstanding cheaper to repeat. If you re-run Assess, you re-question the premise. Cheap to do, expensive to skip.
ADD already predicts this. There’s a thing in the original framework called the cascade: poor Assess produces poor Decide produces poor Do. One direction. The cascade theory tells you exactly where to spend your tokens (Assess) and exactly where to look when something fails (upstream of where it broke). The harness inherits this for free.
Substrate, The New Piece
AIGernon, as it is now, already has a surface Assess-Decide-Do: it maps onto your conversations and can give feedback about where you are in your mental journey. But with Substrate, this goes well above the surface mapping.
The convergent architectures all keep an event log — the append-only mechanical record of every model call, tool call, and observation. That’s the persistence layer. It’s how a fresh sandbox calls wake(sessionId) and reconstitutes the run.
But event logs are noisy. They’re great for replay. They’re terrible for orientation. If you wake up tomorrow and want to know why the agent did what it did yesterday, scrolling four thousand events is not the answer.
The artifacts are the other extreme. Criteria, plan, output — these are crystallized. They tell you the contract, the program, and the result. They don’t tell you what was tried, what surprised, what was rejected, what was learned, where there was hesitation.
So I added a third document per unit. Substrate. One file, Markdown, narrative, written by all three realms at meaningful moments. Not on every event. At realm entry, at decision points, at realm exit. The journey, not the trace.
Substrate sits between the artifacts and the event log. The connective tissue. Criteria and plans crystallize out of it. The event log, which is still very useful, is like a mechanical shadow of the substrate.
Here’s what Substrate unlocks, in the order that matters most:
Re-Assess that targets root cause, not symptom. When Ratify fails, re-Assess reads the Substrate and learns why — “Do struggled at step 3 because Decide’s plan assumed X, but the codebase actually does Y.” Now re-Assess can fix the premise, not retry the symptom. Without Substrate, re-Assess is blind. With it, it’s diagnostic.
Compaction with a preserved anchor. Raw tool calls can be pruned aggressively now because the Substrate already says “tried X, hit Y, switched to Z.” Substrate is what makes the event log compressible without losing what actually happened.
Cross-unit learning. A library of Substrates is the agent’s memoir. Criteria are too dry to learn from. Event logs are too noisy. Substrates sit at the right granularity for “last time I built a feature like this, what was the journey?”
Sixty-second human handoff. When a long running task pauses for human input, the human reads the Substrate, not a JSON pile.
Alignment drift defense. Each unit’s Substrate anchors what the goal meant in original context. When summaries-of-summaries lose fidelity, the original Substrate is the source of truth.
I haven’t seen this in any of the agent literature I’ve read. The closest things are Anthropic’s CLAUDE.md (a living plan the agent edits as it learns) and CHANGELOG.md (portable lab notes). Substrate is closer to a per-unit version of both, written by the active realm as it works.
What’s Next
As I said, implementation started, there’s a supervisor that was prepared with the long running agent theory and trained in Assess Decide Do – basically a coding session on steroids.
The supervisor itself runs ADD on its own work. Each implementation session is one Assess → Decide → Do → Ratify cycle. The progress log — the supervisor’s running journey document — is itself a Substrate. Dogfooding, but also an early stress test. If the supervisor can’t run its own development with ADD, the framework needs revision.
I’ll write up what I learn as I go. The honest question is not whether the architecture is right — I’ve spent enough time with both the agent literature and the ADD framework to be reasonably confident on the shape. The question I’m genuinely curious about is whether the cascade theory and the Substrate concept hold up under real life pressure. We’ll find out.
If you want to follow along, AIGernon is on GitHub. The harness work goes in there. Also, the UI at Aigernon.app should follow along, you can just sign in with Google and test it yourself as it evolves.
And if you’ve been thinking about the same problem from a different angle, I want to hear about it. The convergence the labs landed on suggests there’s a right answer here. ADD might be one description of it.
I've been location independent for 15 years
And I'm sharing my blueprint for free. The no-fluff, no butterflies location independence framework that actually works.
Plus: weekly insights on productivity, financial resilience, and meaningful relationships.
Free. As in free beer.