In June 2025, Shopify CEO Tobi Lütke wrote something that cut through a lot of noise in AI development: “I really like the term ‘context engineering’ over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.”
Andrej Karpathy followed with a similar take, calling context engineering “the delicate art and science of filling the context window with just the right information for the next step.” Simon Willison noted that the term would probably stick because its implied meaning aligns much closer to the actual work involved.
This wasn’t just a terminology debate. These practitioners were naming something the industry had been struggling to articulate: that the limiting factor for AI agents in production isn’t the model, and it isn’t the prompt. It’s the context.
Why prompt engineering ran out of road
The evidence has been building for a while. LangChain’s 2025 State of Agent Engineering report found that 57% of organizations have AI agents in production, yet 32% cite output quality as their top barrier. When researchers dug into the failures, the same root cause kept surfacing: not the model, not the prompt instructions, but the information environment the model was operating in.
Research suggests chatbots hallucinate in roughly 27% of interactions, with factual errors present in 46% of generated responses when models lack grounding. Most of these errors aren’t random model failures. They’re context failures: the model was asked to reason about something it didn’t have sufficient or relevant information to address.
Meanwhile, companies are spending $37 billion on generative AI in 2026, up from $11.5 billion in 2024. The disconnect between investment and results is real, and context is a significant part of why.
Prompt engineering developed as a discipline when AI systems were relatively simple: write a query, refine it, get a better answer. That approach works in demos. It doesn’t scale to production systems where agents need to navigate dynamic data, long conversation histories, and real user variability.
What context actually includes
A prompt is only one piece of what a language model receives before generating a response. In any production system, the model also receives:
- Conversation history (every message in the current session)
- Retrieved documents (results from RAG, web search, knowledge bases)
- Tool definitions (the schema and description of available functions)
- Memory (information persisted from past sessions)
- State and metadata (current time, user data, session information)
- Output format instructions
Each of these components affects what the model produces. Load too much history and you trigger context rot, where accuracy degrades as context grows. Retrieve the wrong documents and the model confidently answers based on irrelevant content. Leave out key tool definitions and the agent can’t take the actions it needs.
This is a fundamentally different problem from writing a good prompt. It’s an architectural problem. The decisions involved include what information to include, how to structure it, when to retrieve versus cache, how much history to preserve, and which tools to expose at which stages of a workflow.
Karpathy described the components of good context engineering as “task descriptions and explanations, few-shot examples, RAG, related multimodal data, tools, state and history, compacting.” That’s not a prompt. That’s a system.
What doesn’t fix context problems
Prompt iteration alone doesn’t work. You can spend weeks refining instructions and still get poor results if the model is receiving irrelevant documents from a misconfigured retrieval system, or a conversation history long enough to dilute its attention across too many tokens.
System prompt bloat is a particularly common failure mode. Teams add more instructions as edge cases emerge, eventually creating prompts so long that the model’s attention is spread thin across them. The instructions exist; the model just stops following all of them reliably. Research on the lost-in-the-middle effect demonstrates that information buried in the middle of long contexts is retrieved at roughly 55-60% accuracy, compared to 70-75% for information placed at the start or end.
Zero-shot approaches, where the model gets minimal context and is expected to perform from instructions alone, work in demos but rarely survive contact with production data. The gap between lab performance and real-world performance is almost always a context gap.
Fine-tuning also doesn’t solve dynamic context problems. It updates what the model knows at training time, not what it receives at inference time. Simon Willison put it directly: “context engineering is what we do instead of fine-tuning.”
What context engineering looks like in practice
Context engineering treats the information environment as the primary engineering surface.
Selective retrieval over bulk loading. Rather than loading everything a model might need upfront, well-designed systems retrieve specific information at the moment it’s needed. A customer service agent shouldn’t receive a customer’s entire history at session start; it should retrieve relevant history when the specific topic comes up. RAG systems implement this, though the quality varies dramatically based on how chunks are structured and how queries are matched.
Structured context over prose. Raw text documents are hard for models to navigate. Structured, organized information with clear hierarchies performs better: labeled sections, JSON schemas, tables with explicit column headers. The same information in a structured format retrieves more reliably than in a paragraph.
Focused context over maximal context. The research on context rot consistently shows that more context isn’t better when it dilutes signal with noise. A 500-token context with exactly the right information outperforms a 50,000-token context containing that information plus irrelevant surrounding content.
Memory architecture as a design decision. Short-term context (current session), long-term memory (persistent facts about a user or domain), and episodic memory (records of past interactions) each serve different purposes. Agents that work well at scale treat these as distinct stores with different retrieval strategies, not as one undifferentiated stream of text.
The market is responding to this shift. Berlin-based Qontext raised $2.7 million in February 2026 to build what they describe as “an independent context layer for AI.” Decube raised $3 million for similar infrastructure. The context layer is becoming a distinct technical category, separate from model providers and retrieval systems.
Practical takeaways
The shift from prompt engineering to context engineering reflects how the field has matured. Early AI applications could often succeed on a well-crafted system prompt. Production systems serving real users with real data need more than that.
A few things to apply immediately:
- Audit what your model actually receives before it generates a response. Conversation history, retrieved documents, tool schemas: the full context is often more revealing than the prompt alone.
- Treat context quality as an engineering concern. Instrument it, test it, and iterate on it with the same rigor as application code.
- Design for focus, not comprehensiveness. The goal is the right context at the right time, not the most context possible.
- Structure before stuffing. Convert documents and data into clearly structured formats before putting them in context. The same content retrieves and reasons better when it’s organized.
Tools like Wire approach this by transforming raw documents into structured, queryable context containers that agents can access dynamically rather than in bulk. But the principle applies regardless of tooling: what separates production AI that works from AI that doesn’t is almost always context architecture, not prompt cleverness.
References
- Tobi Lütke on context engineering (X, June 2025)
- Andrej Karpathy on context engineering (X)
- Simon Willison: Context Engineering
- Phil Schmid: The new skill in AI is not prompting, it’s context engineering
- LangChain: State of Agent Engineering 2025
- Qontext raises $2.7M pre-seed to build context layer for AI
- Stanford: Lost in the Middle (arXiv:2307.03172)
- Context Rot: Why AI performance degrades with more information
- RAG is not enough: when retrieval fails your AI