In 2023, “prompt engineer” was listed as a six-figure job title. Courses sold out. Blog posts promised magic formulas. The pitch was simple: learn how to phrase your requests, and AI will do anything.
That pitch is falling apart.
Not because AI has gotten worse, but because the problems people are now trying to solve with AI can’t be fixed by better phrasing. When Andrej Karpathy endorsed the term “context engineering” last year, he was direct about why: “People associate prompts with short task descriptions you’d give an LLM in your day-to-day use. In every industrial-strength LLM app, context engineering is the [real discipline]: the delicate art and science of filling the context window with just the right information for the next step.”
The shift is already happening in practice. Investors are funding “context layer” infrastructure companies because that’s where the actual problems are.
The problem with prompt engineering
Prompt engineering operates on a narrow assumption: the model is capable, and the right phrasing unlocks that capability. Craft the right sentence, and the AI performs.
This works for simple tasks. For isolated questions, one-off summaries, or generating a draft email, prompting is often sufficient. The model has everything it needs in your message.
But AI use cases have grown beyond one-off tasks. According to LangChain’s 2025 State of Agent Engineering report, 57% of organizations now have AI agents in production. These agents coordinate across tools, maintain context across sessions, and need to make decisions based on information that doesn’t fit in a single prompt.
For these applications, prompt engineering addresses the wrong variable. The 32% of organizations citing “output quality” as their top barrier aren’t failing because of bad phrasing. They’re failing because the model doesn’t have access to the right information at the right time. The problem is upstream of the prompt.
You can’t prompt your way out of missing context.
What context engineering actually is
Context engineering is the practice of systematically designing what information an AI model has access to when it generates a response.
Where prompt engineering asks “how should I phrase this?”, context engineering asks “what does the model need to know, and how do I make sure it has that?”
In practice, the context a model receives includes far more than a user’s query. At inference time, an LLM can draw on several layers of information:
- System instructions: Behavioral guidelines, role definitions, constraints
- Conversation history: Prior turns in the current session
- Retrieved knowledge: External documents, database results, API responses
- Long-term memory: Information persisted across sessions
- Tool definitions: Descriptions of functions the model can invoke
- Structured outputs: Format specifications for responses
Prompt engineering touches only the user query layer. Context engineering designs the entire stack.
The analogy that captures this well: if prompt engineering is writing a good sentence, context engineering is writing the screenplay. The sentence matters, but the screenplay determines whether the scene works at all.
Why agent failures are usually context failures
The clearest way to see this distinction is through failure modes.
When an AI agent produces a wrong answer, the instinct is often to rewrite the prompt. The agent “didn’t understand” the instruction, so you clarify it. Sometimes this helps. More often, you’re treating a symptom rather than the cause.
Most agent failures trace back to one of three context problems:
Missing information. The model gives a confident answer without knowing something relevant. It doesn’t ask because it doesn’t know it doesn’t know. The fix isn’t rephrasing; it’s ensuring the model has access to what it needs before generating a response.
Stale information. The model draws on information that was once accurate but is no longer. Training data has a cutoff, and even retrieved data can be out of date. A customer support agent with access to last year’s pricing will confidently give wrong answers. The fix requires context pipelines that refresh reliably.
Overloaded information. The model has too much context to reason over effectively. Chroma’s research on context rot showed that models drop from 95% accuracy to 60-70% accuracy as context length increases, even on trivially simple tasks. Dumping everything into the context window is not the same as giving the model what it needs. Bigger context windows help at the margins; they don’t fix poor context architecture.
These are engineering problems, not prompting problems. They require designing retrieval pipelines, structuring information for AI consumption, managing what gets loaded and when, and building feedback loops to detect when context has gone wrong.
The gap between demos and production
There’s a reliable pattern in AI development: demos work, and production doesn’t. A prototype built over a weekend impresses stakeholders. The same system deployed to real users starts failing in ways the demo never did.
Context is almost always the reason.
Demos work because they’re built around hand-crafted examples where the context is perfectly controlled. The developer knows exactly what the model needs and provides it manually. Production fails because the model encounters information it wasn’t designed to handle, queries that require data the system doesn’t retrieve, and edge cases that expose gaps in the context design.
The teams that build reliable production AI aren’t necessarily using better models. They’re investing in context infrastructure: retrieval systems that surface the right information, structured formats that models can navigate efficiently, memory layers that persist relevant context across sessions, and routing logic that decides what context to load for each query type.
This is why AI investment has shifted toward “context layer” infrastructure rather than raw model capability. The models are capable. What they need is a reliable supply of the right information.
What good context engineering looks like
The discipline is still maturing, but several approaches have proven reliable:
Retrieval-augmented generation (RAG) as a foundation. Rather than loading entire knowledge bases into context, RAG systems retrieve only the chunks relevant to a specific query. This keeps context focused and manageable. That said, RAG has real limitations: it requires good chunking strategies, accurate embeddings, and queries that can be matched to relevant passages. RAG also fails in predictable ways when queries require synthesizing information across multiple documents or when the retrieval step surfaces the wrong content.
Structured over raw. Raw text is hard for models to navigate. Structured information with clear hierarchies, labeled fields, and consistent formatting helps models extract what they need without relying on positional attention. Documents transformed into structured representations outperform document dumps in most agentic tasks.
Context as a deliberate design layer. In agent systems, what context gets loaded should be a decision, not an accident. Before a model generates a response, a routing or planning layer should determine what the model needs, retrieve exactly that, and load it in a format optimized for the task. This is more complex to build than a single prompt, but it’s what separates fragile demos from reliable production systems.
Separation across tools. Most teams work across multiple AI tools, and context rarely travels between them. Information shared with one tool disappears when you switch to another. Context engineering for teams often means designing explicit context portability: how does context created in one workflow become accessible in others?
Practical steps
If you’re working with AI systems regularly, a few shifts apply immediately:
-
Diagnose failures as context problems first. Before rewriting a prompt, ask what information the model was missing or had wrong. Most failures have a context explanation.
-
Structure inputs before they hit the model. Raw documents, conversation dumps, and unorganized data all degrade model performance. Investing in preprocessing often yields more improvement than prompt tuning.
-
Design what gets retrieved, not just how queries are phrased. In RAG systems, the retrieval configuration (chunk size, embedding model, top-k, filtering logic) matters more than the prompt that wraps the results.
-
Track what context you’re actually sending. Logging the full context at inference time is the fastest way to debug failures. Most teams don’t do this and end up guessing at causes.
Context containers, like those Wire creates, apply this approach at the infrastructure level: processing documents into structured, queryable context that agents can retrieve precisely rather than in bulk. The principle, though, applies regardless of tooling.
Prompt engineering will remain useful. Knowing how to phrase requests clearly, structure few-shot examples, and write effective system instructions still matters. But it’s a small part of what determines whether an AI system actually works. The larger part is context: what the model knows, when it knows it, and how that information is structured.
That’s what context engineering addresses. It’s not a buzzword. It’s the job.