Quality is decided by the harness around the model, not the model itself. How context layering, executable knowledge, hooks, and verification loops raise a team's floor — with the MCP server we built, in code.
Harness Engineering — What Makes the Same Model Behave Differently
> Part 4 of "The Evolution of Driving LLMs." ① Prompting · ② Vibe coding · ③ Agents · ④ Harness engineering · ⑤ Open models.
The homework Part 3 left was clear — agents are powerful but dangerous without guardrails. The stage that does that homework is harness engineering. The core claim: quality is decided by the harness around the model, not the model itself. Same model — the system around it parts the results.

The harness = the system around the model
The harness is every mechanism outside the model: what context to inject, which tools to allow, where to stop it, how to verify the result. The parts of a good harness look roughly like this.
- Context layering — split knowledge into Global (company-wide), Domain (per team/business), and Local (per repo), and inject only what's needed. You don't hand a new hire the entire wiki at once.
- Executable knowledge — keep guides not as documents but in a form that reads as a manual to a human and as a system prompt to an LLM. Fix one place and everyone's agent behavior changes.
- Hooks — intercept and correct specific actions. E.g. trying to commit to main gets blocked and routed to a feature branch.
- Verification loops — generate → read-only critique → regenerate, with the pass bar pinned in code.
A slice of the harness we built — MCP self-call
That sounds abstract, so here's something we actually shipped. We attached an MCP server so external agents can use Trail Studio as a tool. The most important design decision: the MCP tools don't reimplement business logic. Instead, they call our own REST routes in-process (self-call). REST is the single source of truth.
# An MCP tool = a thin adapter. It doesn't duplicate logic; it self-calls its own REST route.
async def studio_create_cardnews(topic: str, ...) -> dict:
principal = current_principal() # API key → workspace auth
token = create_access_token(principal) # short-lived JWT
# call our own app's REST in-process via httpx ASGITransport
resp = await self_post("/api/cardnews-llm", json={...},
headers={"Authorization": f"Bearer {token}"})
return {"job_id": resp["id"]} # credits, verification, trace owned solely by REST
The reason this design holds up is clear: zero double credit charges, zero duplicated business logic, one source for verification and trace. Add more tools and you only touch REST in one place. Whatever the model does, the safety mechanisms are owned solely by the harness.
Hooks follow the same philosophy. Risky features live behind a default-OFF flag, and when OFF they're byte-identical to the existing behavior. You separate turning on a new capability from not breaking the old one.
Governance that raises the floor
The real effect of harness engineering is raising a team's floor. Package your strongest engineer's workflow (lint rules, branch strategy, verification steps) as a plugin or skill and ship it, and that discipline gets laid down automatically no matter who's working. It becomes the most powerful, modern governance tool you have.
And you can take a harness verified locally straight to production (dev-prod parity). Instead of standing up a separate RAG server and tuning scores, the context you eyeballed runs as-is.
To sum up
Harness engineering isn't "swap in a better model" — it's "design around the model." Layer the context, make knowledge executable, correct with hooks, pin verification in code. That's the path to safer, more consistent results on the same model.
But a good harness has a side effect: every run leaves structured data behind. What if you could use that data to tune a model to your own domain? The final part: open models.
More posts

Open Models — Owning the Model Layer Itself
The final stage of the evolution — own the model instead of renting it. Nous Research Hermes (4.3, open weights), the Hermes function-calling standard, and the data flywheel a harness produces to tune a domain model — with code.

Agents — When the Model Started Running the Loop
OpenClaw (formerly Moltbot) put autonomous agents in the spotlight — the model uses tools, observes results, and runs the loop itself. The perceive→plan→act→observe loop, and the permission/HITL problem broad autonomy brings — with code.

Vibe Coding — Erasing the Friction of Building
Vibe coding, the term Karpathy popularized — describe the intent and the LLM writes the code. What it solved, where it breaks, and how to bolt a verification loop (isolate, run, check) onto it — with code.