How to connect AI to private data safely

Jitpal Kocher · · 7 min read

Key takeaway

Most enterprise AI data exposure comes from how context reaches the model, not from the model itself. Research shows 27.4% of corporate data pasted into AI tools qualifies as sensitive, and RAG systems routinely bypass document-level permissions during vector embedding. Five context engineering patterns, including pre-retrieval authorization, context redaction, tool-based access, role-based isolation, and context auditing, let teams give AI access to private data without exposing what it shouldn't see.

Employees are already feeding your company data to AI. Cyberhaven’s research found that 77% of employees have shared sensitive company data via AI tools, and 27.4% of corporate data pasted into AI qualifies as sensitive, up from 10.7% a year earlier. The average organization now sees 223 data policy violations per month involving generative AI applications.

Blocking AI access to company data entirely does not work either, because employees route around it with personal accounts and unsanctioned tools, which is worse.

The risk is in the context, not the model

Most AI data exposure does not come from model vulnerabilities or sophisticated attacks. It comes from what enters the context window. When someone pastes a customer list into ChatGPT, or when a RAG pipeline retrieves an HR document for a sales query, the exposure happens before the model generates a single token. Once data is in the context, you have lost control of it.

Many RAG implementations introduce an access control gap. When documents are chunked and embedded into vector databases, the metadata containing access control lists can get stripped away or flattened during the embedding process. Without explicit safeguards, a RAG system may serve documents to users who would not have access to the originals. AWS’s security team and Kiteworks both flag this pattern: unlike traditional search engines, RAG can return results directly to the LLM, bypassing permission checks at the original data source.

IBM’s 2025 Cost of a Data Breach report found that 97% of organizations that suffered AI-related breaches lacked proper access controls. Breaches involving shadow AI cost $4.63 million on average, $670,000 more than standard incidents.

This is a context engineering problem. Securing AI data access means controlling what enters the context window before the model ever sees it.

Five patterns that work

The teams getting enterprise AI data access right have converged on a set of context engineering patterns. None of these require specialized infrastructure. They require thinking about context as a security boundary.

1. Scope before you retrieve

The core principle of secure AI data access is: if unauthorized content never enters the context window, the model cannot leak it. This means applying authorization checks before retrieval, not after.

In practice, this means the retrieval layer checks user permissions and filters results before they reach the model. Pinecone’s guidance on RAG access control recommends attaching access control metadata to every vector and filtering at query time. AWS recommends applying authorization and residency filters before similarity search, or at minimum as a pre-rerank step.

The alternative, retrieving everything and filtering afterward, means the model has already seen the data. Post-retrieval filtering is not a security control.

2. Redact before you augment

Even within authorized data, not every field belongs in the context. A customer support agent answering a billing question needs account status and plan type. It does not need social security numbers, payment card details, or internal notes from legal.

Data redaction before augmentation strips sensitive fields before they enter the prompt. This can be as simple as removing PII columns from structured data, or as sophisticated as automated classification and masking pipelines. Amazon Bedrock and similar platforms offer built-in guardrails that scan and redact sensitive information in real time. The principle is the same regardless of tooling: minimize what the model sees to what it actually needs.

3. Use tools, not raw access

When an AI agent queries data through a defined tool interface (like an MCP server), every request is structured, auditable, and filterable. The agent asks a specific question and gets a specific answer. It never browses a raw database or file system.

This is fundamentally different from giving an agent read access to an entire knowledge base. Tool-based access creates a natural chokepoint where you can enforce permissions, log queries, rate-limit requests, and shape what context the agent receives. Context containers, like those created by Wire, implement this pattern by transforming raw documents into structured context that agents query through defined MCP tools rather than direct data access.

The contrast matters: 90% of agents are over-permissioned, holding roughly 10x more privileges than required. Tool-based access makes least-privilege the default rather than something you have to engineer after the fact.

4. Isolate context per role

Different users and different tasks should get different context. A customer support agent, a sales assistant, and an engineering copilot should not share a context boundary, even if they serve the same organization.

Context isolation means each use case operates within its own scoped data boundary. This limits blast radius: if one agent’s context is compromised or misconfigured, it does not expose data from other workflows. The isolation can be physical (separate databases, separate vector stores) or logical (namespace-level separation with enforced access controls).

This pattern is already standard practice in traditional application security. Applying it to AI context is the same principle: least privilege, applied to what the model sees rather than what APIs it can call.

5. Audit what enters the window

Only 21% of executives report complete visibility into agent permissions and data access patterns. Most organizations log API calls and model outputs. Almost none log what context the model actually received at inference time.

This is the observability gap. If you do not know what went into the context window, you cannot diagnose why the model produced a particular output, whether sensitive data was exposed, or whether context poisoning occurred. Logging the full context payload (or a hashed/summarized version for compliance) at inference time is how you detect scope creep, permission bypass, and data leakage before they become incidents.

What does not work

Relying on model instructions. Telling a model “do not share confidential information” in the system prompt is not a security control. Models follow instructions probabilistically. Prompt injection, context poisoning, and simple edge cases can override behavioral instructions. If the data is in the context, instruction-level guardrails are your last and weakest line of defense.

All-or-nothing access. Granting an agent read access to “everything in the knowledge base” and hoping it only uses what is relevant is how 88% of organizations end up with AI agent security incidents. The agent will process everything it can see. If it can see sensitive data, that data is at risk.

Retroactive monitoring without pre-retrieval controls. Monitoring model outputs for sensitive data is useful for detection, but by the time sensitive content appears in an output, the exposure has already happened. The model processed the data, and depending on the provider, it may have been cached, logged, or used for training. Prevention at the retrieval layer is always preferable to detection at the output layer.

Start with the context boundary

The path to safe AI data access is not blocking AI from your data entirely. That approach fails because employees route around it with shadow AI, which is more dangerous than sanctioned access with controls.

Instead, treat context as a security perimeter. Scope what enters the window. Redact what does not belong. Use tools instead of raw access. Isolate by role. Audit what the model sees. These are not theoretical patterns. They are how the organizations with the lowest AI incident rates are already operating.

The data your team needs to work with AI is already sensitive. The context engineering question is how you deliver it safely.


Sources: Cyberhaven: AI Insider Threats · Metomic: Hidden Data Leakage Crisis · IBM: Cost of a Data Breach 2025 · Pinecone: RAG Access Control · AWS: RAG Authorization · Kiteworks: Secure RAG Pipelines · Gravitee: State of AI Agent Security 2026 · Obsidian Security: AI Agent Landscape · CrowdStrike: Data Leakage

Frequently asked questions

How do I prevent AI from leaking company data?
The most effective approach is pre-retrieval authorization: filter sensitive data before it enters the AI's context window, not after. Combine this with data redaction (stripping PII before augmentation), tool-based access instead of raw database queries, and logging what context the model actually sees at inference time. Relying on model instructions alone ('don't share this') is not reliable.
Is RAG secure for enterprise data?
Standard RAG implementations have a known access control gap. Vector databases often flatten document permission hierarchies during embedding, meaning a RAG system may serve documents to users who would not have access to the originals. Secure RAG requires enforcing authorization before augmentation, applying access filters at the retrieval layer before similarity search runs.
Can AI agents access data they shouldn't?
Yes. Research shows 90% of AI agents are over-permissioned, holding roughly 10x more privileges than their tasks require. The default configuration for most agent frameworks grants broad read access rather than scoping to specific resources. This is a context engineering problem: the agent is authorized, but it receives far more context than it needs.
What is the safest way to give AI access to internal documents?
Use a tool-based interface rather than direct database or file system access. When agents query through defined tools (like MCP servers), every request is auditable, responses can be filtered, and the agent never sees raw data it wasn't asked about. Combine this with context isolation, where each use case gets its own scoped data boundary.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container