← Back to blog

Quality is decided by the harness around the model, not the model itself. How context layering, executable knowledge, hooks, and verification loops raise a team's floor — with the MCP server we built, in code.

· TRAIL Labs
Harness EngineeringMCPAgentsGovernance

Harness Engineering — What Makes the Same Model Behave Differently

> Part 4 of "The Evolution of Driving LLMs." ① Prompting · ② Vibe coding · ③ Agents · ④ Harness engineering · ⑤ Open models.

The homework Part 3 left was clear — agents are powerful but dangerous without guardrails. The stage that does that homework is harness engineering. The core claim: quality is decided by the harness around the model, not the model itself. Same model — the system around it parts the results.

The harness around a model — scoped tools, hooks, layered context, and verification loops making the same model's output safe and consistent

The harness = the system around the model

The harness is every mechanism outside the model: what context to inject, which tools to allow, where to stop it, how to verify the result. The parts of a good harness look roughly like this.

  • Context layering — split knowledge into Global (company-wide), Domain (per team/business), and Local (per repo), and inject only what's needed. You don't hand a new hire the entire wiki at once.
  • Executable knowledge — keep guides not as documents but in a form that reads as a manual to a human and as a system prompt to an LLM. Fix one place and everyone's agent behavior changes.
  • Hooks — intercept and correct specific actions. E.g. trying to commit to main gets blocked and routed to a feature branch.
  • Verification loops — generate → read-only critique → regenerate, with the pass bar pinned in code.

A slice of the harness we built — MCP self-call

That sounds abstract, so here's something we actually shipped. We attached an MCP server so external agents can use Trail Studio as a tool. The most important design decision: the MCP tools don't reimplement business logic. Instead, they call our own REST routes in-process (self-call). REST is the single source of truth.

# An MCP tool = a thin adapter. It doesn't duplicate logic; it self-calls its own REST route.
async def studio_create_cardnews(topic: str, ...) -> dict:
    principal = current_principal()                  # API key → workspace auth
    token = create_access_token(principal)           # short-lived JWT
    # call our own app's REST in-process via httpx ASGITransport
    resp = await self_post("/api/cardnews-llm", json={...},
                           headers={"Authorization": f"Bearer {token}"})
    return {"job_id": resp["id"]}                     # credits, verification, trace owned solely by REST

The reason this design holds up is clear: zero double credit charges, zero duplicated business logic, one source for verification and trace. Add more tools and you only touch REST in one place. Whatever the model does, the safety mechanisms are owned solely by the harness.

Hooks follow the same philosophy. Risky features live behind a default-OFF flag, and when OFF they're byte-identical to the existing behavior. You separate turning on a new capability from not breaking the old one.

Governance that raises the floor

The real effect of harness engineering is raising a team's floor. Package your strongest engineer's workflow (lint rules, branch strategy, verification steps) as a plugin or skill and ship it, and that discipline gets laid down automatically no matter who's working. It becomes the most powerful, modern governance tool you have.

And you can take a harness verified locally straight to production (dev-prod parity). Instead of standing up a separate RAG server and tuning scores, the context you eyeballed runs as-is.

To sum up

Harness engineering isn't "swap in a better model" — it's "design around the model." Layer the context, make knowledge executable, correct with hooks, pin verification in code. That's the path to safer, more consistent results on the same model.

But a good harness has a side effect: every run leaves structured data behind. What if you could use that data to tune a model to your own domain? The final part: open models.

More posts