← Back to blog

The first place results diverge on the same model is the prompt. Prompting isn't an incantation — it's context design. Here's why a single prompt's ceiling is what called the next stages into being.

· TRAIL Labs
PromptingLLMTool UseStructured Output

Prompting — The First Way We Drove LLMs

> Part 1 of "The Evolution of Driving LLMs." ① Prompting · ② Vibe coding · ③ Agents · ④ Harness engineering · ⑤ Open models. A stage-by-stage look at how the way we drive LLMs has changed.

Everyone starts the same way with an LLM. You open a chat box and type out what you want. Yet on the same model and the same screen, results vary wildly from person to person. The first fork in the road is the prompt.

A vague one-line prompt yielding uneven results, then prompt engineering that designs context, structure, and references to stabilize the output

People often mistake prompting for "finding the magic spell" — collecting phrases that supposedly make it work. But what actually decides the result isn't the incantation; it's the design of the context you lay down for the model. Prompting is the first stage of driving an LLM — and really, the smallest unit of context engineering.

A prompt isn't a spell — it's context design

A vague line gives a vague result. "Write something on this topic" leaves the model too much freedom — tone, length, and format all get decided on the fly, so it comes out different every time.

A good prompt shrinks the blanks the model has to guess. Fill in what, for whom, in what format, and what to avoid up front, and the variance in the output drops sharply — on the very same model.

A vague one-line prompt produces output that wobbles in format and quality, but designing in context, schema, and references yields verified structured output

Forcing structure makes results stable

Asking nicely for a format often isn't enough. Even with "give me JSON," the model sometimes prepends an explanation or drifts on line breaks, and the code parsing it keeps breaking.

So in practice we don't ask for a format — we enforce it. Pin the output schema with Anthropic's tool_use (or OpenAI's function calling) and the model can only answer in that structure. This is the exact pattern we use in our card-news generation pipeline.

# ① Vague prompt — the format wobbles every time
resp = await client.messages.create(
    model="claude-opus-4-8",
    messages=[{"role": "user", "content": "make 6 slides on this topic"}],
)
text = resp.content[0].text   # stray preamble / line breaks → parsing breaks

# ② Pin the schema with tool_use — it can only answer in this structure
tools = [{
    "name": "emit_slides",
    "input_schema": {
        "type": "object",
        "properties": {
            "slides": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "body":  {"type": "string"},
                    },
                    "required": ["title", "body"],
                },
            },
        },
        "required": ["slides"],
    },
}]
resp = await client.messages.create(
    model="claude-opus-4-8",
    tools=tools,
    tool_choice={"type": "tool", "name": "emit_slides"},   # must call this tool
    messages=[{"role": "user", "content": prompt}],
)
slides = resp.content[0].input["slides"]   # validated structure — no parsing

①and ② are the same model on the same topic. The only difference is whether you designed the output. In ②, parsing errors disappear, and empty or malformed fields get caught at the schema layer. This is the moment you start treating a prompt not as prose but as an interface.

You have to hand it references to sound "like us"

Even with structure, tone is a separate problem. The model doesn't know "our brand voice." So instead of describing it in words, we include a few real examples that landed well. This is few-shot. Two genuine examples convey tone far more precisely than ten lines of description.

Go one step further and, rather than pasting examples by hand each time, you define the brand context once and prepend it to the prompt automatically. In our pipeline we manage this as a memory block, so articles, card news, and detail pages all inherit the same tone context. The prompt shifts from a disposable sentence into reusable context.

But there are places a single prompt can't reach

Get this far and prompting alone takes you a long way. But a prompt has three structural limits.

  • It's stateless — one request, one response, and that's it. The model doesn't know what it just did, or whether it actually worked.
  • It's toolless — it can't run code, read a file, or check a result. It only answers in its head.
  • There's no verification — when it's wrong, it doesn't know it's wrong. Checking and fixing is entirely on you.

These three limits called the next stages into being. "Don't write it all at once — let the model write code and fix it by running it" became vibe coding, and "let the model use tools itself and iterate by observing results" became agents.

To sum up

Prompting is the first way we drive LLMs, and still the foundation under every later stage. The key isn't a magic spell — it's context design. Shrink the blanks, enforce structure, hand over references. That alone buys you completely different stability on the same model.

But a single prompt has no state, no tools, no verification. In the next part, we cover the first approach that shook those limits — vibe coding.

More posts