7 context engineering techniques for production
Key takeaway
AI customer support replies sound generic because teams treat tone as a prompt instruction instead of a context selection problem. Models cannot infer brand voice from adjectives like 'friendly' or 'concise'; they reproduce patterns they actually see in the prompt. What works instead is context engineering: retrieving three to five high-CSAT past replies that match the current ticket's intent, structuring them as input-output pairs, and pruning everything else. This shifts brand voice from a prompt-engineering exercise to a retrieval and structuring problem.
Most AI customer support replies sound generic because teams treat brand voice as a prompt-engineering problem. They write longer system prompts with adjectives like “friendly”, “concise”, and “professional”, and expect the model to translate those instructions into voice. It doesn’t work, because voice isn’t something a language model can infer from a description. Voice lives in the actual sentences your team has written, and the only way to get it into a reply is to put real exemplars into the context.
This is a context engineering problem, not a prompting problem. The shift that matters is treating tone the same way you’d treat any other piece of grounded information the model needs: retrieve it, structure it, and put it where the model will attend to it.
Adjectives don’t compress to behavior in language models. When you tell a model to “write in our brand voice”, it has no representation of your specific voice in its weights. It can only fall back on what “brand voice” plus “friendly” plus “professional” averages out to across its training distribution, which is the same bland, hedged, slightly corporate tone you see across half the help-center copy on the internet. That’s why every AI support tool sounds vaguely similar despite having very different system prompts.
Researchers have shown for years that few-shot examples beat instruction tuning for stylistic control. The original GPT-3 paper showed that three to five exemplars consistently outperformed equivalent natural-language descriptions for tone, format, and structure tasks. More recent reproductions on instruction-tuned models show the same pattern: when you can demonstrate, demonstrate. Tone is the canonical case where demonstration wins, because the gap between “describe friendly” and “show friendly” is enormous.
The deeper problem is that “voice” isn’t a single dimension a prompt can target. It’s a bundle of choices about contractions, sentence length, hedging language, when to apologise, how to handle bad news, what to bold, whether to use the customer’s first name. No prompt is going to encode all of that. But a real past reply encodes all of it implicitly, in five sentences.
The reframe that matters: your brand voice is data, not configuration. It already exists, in the form of every reply your support team has ever sent. The job of the AI agent isn’t to invent it from a description; it’s to retrieve the right slice of it and condition on it.
This puts brand voice in the same category as any other piece of grounded context the agent needs. You don’t tell an AI agent to “answer accurately about our pricing”. You give it the pricing page. The same logic applies to voice: don’t describe it, deliver it.
That changes what the agent’s context looks like. Instead of a long system prompt full of style adjectives, you get a short system prompt and a payload of carefully selected exemplar replies, structured to make imitation easy. The system prompt’s job becomes scaffolding (“you are responding to a support ticket; match the style of the examples below”), and the actual stylistic conditioning happens through the exemplars.
This also means voice problems become retrieval problems. If the AI sounds wrong, you don’t tweak adjectives in the prompt. You ask: did we retrieve the right exemplars for this ticket? Were they good replies? Are there enough of them, or too many? Are they structured so the model can extract the pattern? Those are answerable questions, with measurable outcomes. “Tone is off” is not.
A support agent that produces consistently on-brand replies needs four distinct context layers, each playing a different role. They’re not interchangeable, and treating them as a single bag of “knowledge base” is one of the main reasons most implementations sound generic.
| Layer | What it provides | How to select | How to structure |
|---|---|---|---|
| Exemplar replies | The voice itself: cadence, hedging, vocabulary, formatting | Filter by intent match + high CSAT + resolved status; cap at 3-5 | Input-output pairs: original ticket → final reply, clearly delimited |
| Macros and canned snippets | Phrasing the team has already approved (refund language, escalation copy, legal-cleared sentences) | All macros relevant to the ticket category | Tagged with usage conditions; not pasted verbatim, used as building blocks |
| Brand glossary | Words to use, words to avoid, how to refer to the product | Static, applies to every reply | Compact list at the top of context, not buried mid-prompt |
| Customer history context | Account state, prior tickets, sentiment, plan tier | Last 2-3 interactions + current account state | Structured summary, not raw transcript dumps |
Most AI support tools collapse these into one layer (usually a RAG blob from a help center) and wonder why the output sounds like it came from a help center. Each layer has a distinct selection rule and a distinct structuring requirement. Mixing them dilutes all four.
The exemplar layer is the one that produces voice. The other three produce accuracy, compliance, and personalisation, but they don’t fix tone on their own. If you only had time to get one layer right, this is the one.
Exemplar selection is where most “we tried RAG for support” implementations fall apart. The common failure mode is retrieving by recency or by raw embedding similarity, which surfaces replies that look topically similar but were not necessarily good replies. The model imitates whatever it sees, including the bad examples.
The selection rule that works in practice is a layered filter. First, narrow to the same intent: a refund request goes to refund replies, a billing question goes to billing replies. Second, narrow to outcomes that succeeded: the ticket was resolved, the customer didn’t escalate, CSAT was high if you have it. Third, semantic similarity within that pool to find the closest match for the specific ticket. Recency comes last, as a tiebreaker, not a primary filter.
The count matters too. Three to five exemplars consistently outperforms more, because of how transformer attention distributes across long contexts. The Stanford lost-in-the-middle research showed that models attend most strongly to the start and end of context and underweight the middle. With ten exemplars, the third through eighth examples get less attention than the first two and the last two. With four exemplars, all four are in the high-attention zone.
This is one of the cleanest applications of context budgets: every additional exemplar beyond five steals attention from the others without adding meaningful new signal. Selection quality scales; exemplar count doesn’t.
Even with the right exemplars, raw transcript dumps don’t work. A typical support ticket has an opening question, three or four back-and-forth messages, internal notes, system events, and a final reply. If you paste all of that into the context, the model has to figure out which part is the input and which part is the output to imitate. Most of the time, it imitates the wrong thing or averages across all of it.
Exemplars need to be pre-structured before they hit the prompt. Each exemplar becomes a clearly delimited pair: the customer’s incoming message (compressed if long), and the team’s final reply (full text). Internal notes and system events are stripped or summarised separately. Delimiters like <example>...</example> or --- TICKET --- / --- REPLY --- make the structure unambiguous.
This is the difference between dumping a database row and engineering a prompt input. The exemplar isn’t the past ticket; it’s the past ticket reshaped into something the model can imitate cleanly. The same principle applies to the brand glossary and customer history layers: structured representations beat raw text every time, especially when multiple sources have to coexist in the same context window.
This is the gap most teams hit when they try to build this from scratch. The retrieval is solvable, but the structuring layer (deciding what to include from each ticket, how to delimit it, how to compress the customer message without losing intent) is where the work compounds. The Wire how-to guide on answering support emails walks through one concrete implementation: a Wire container holds the exemplar pool, the agent retrieves and structures past replies on each ticket, and the final draft is produced from those structured exemplars rather than from a long system prompt.
If you’re trying to fix generic-sounding AI replies, work through these in order:
This stack does what no prompt instruction can: it conditions the model on your actual writing, on tickets that resemble the one in front of it, in a structure it can imitate. That’s what brand voice in AI support replies actually is. Not a setting. Not a prompt. A retrieval and structuring problem.
The teams getting this right have stopped iterating on system prompts and started iterating on exemplar pools, retrieval rules, and context structure. The teams still iterating on adjectives are the ones whose AI still sounds like AI.
Sources: Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) · Language Models are Few-Shot Learners (Brown et al., 2020) · Qualtrics 2026 Consumer Experience Trends · Wire: how to answer support emails
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container