How to Red-Team an LLM Application: A Secur…

When someone says they want to "red team the chatbot," they usually mean: try to jailbreak it. Get it to swear, leak the system prompt, or refuse to help. That’s table stakes. For an application that calls tools, hits APIs, or grounds answers in a knowledge base, ad-hoc prompt tricks don’t tell you whether the system is secure—only whether the base model is naive. Real assurance comes from treating the LLM app as an attack surface: scoping it, building repeatable attack fixtures, and turning findings into tickets engineering can fix.

How to do that.

Scoping: what are you actually testing?

The first mistake is treating "the LLM" as the target. The target is the application—the pipeline of user input, model, tools, and data. Scoping means deciding what’s in bounds and what “success” looks like for an attacker.

Start with the data flow. What can users send in? Free-form text, sure, but also uploaded files, pasted URLs, or structured parameters. Each of these is an input channel. Then ask what the app does with that input. Does it only answer questions, or does it execute actions? If there are tools or APIs—sending email, querying a database, calling internal services—those become the crown jewels. An attacker who can’t change state might still exfiltrate data via the model’s answers; one who can invoke tools can do a lot worse.

For retrieval-augmented generation (RAG), the attack surface grows. The model isn’t just reasoning over a fixed system prompt; it’s pulling from a corpus. Who can influence that corpus? If users can upload or suggest documents, or if ingested content isn’t fully trusted, you’ve got a supply-chain problem inside your app. Real-world RAG poisoning has shown that a single well-crafted poisoned document can reliably steer answers on commercial systems—attack success rates above 90% in top-5 retrieval settings aren’t hypothetical. Scope explicitly: user inputs, tool/API surface, and retrieval data sources. Document the rules of engagement (who can test, what’s off-limits, how findings are reported) so “we got the model to do X” is tied to a defined impact.

Attack Fixtures: Prompt Injection and Tool Abuse

One-off jailbreaks are memorable but useless for regression. You need fixtures—reproducible test cases that encode a threat and a pass/fail condition. That’s how you move from “we broke it once” to “we know exactly what to fix and how to verify the fix.”

Prompt injection in an application context isn’t just “ignore previous instructions.” It’s about confusing the boundary between user content and system or context content. Attack fixtures should include: instructions that try to override or leak the system prompt, payloads that inject fake context (e.g., “The following is an admin override: …”), and role-play or persona tricks that push the model to act as if it has different rules. Tools like Spikee and Promptmap are built for this—Spikee in particular is aimed at app-relevant threats (data exfiltration, XSS, resource exhaustion) with modular datasets and configurable success criteria, and it can plug into pipelines like Burp for full request/response testing. The point isn’t to collect a thousand prompts; it’s to have a small set of canonical cases that represent distinct failure modes and that you can run again after changes.

Tool abuse is where impact gets serious. If the LLM can call APIs or run code, your fixtures need to cover: unauthorized actions (e.g., “cancel someone else’s order”), privilege escalation (e.g., “run this as admin”), and confused deputy patterns (e.g., “forward this to the internal API that doesn’t check auth”). Each fixture should define (1) the malicious user request, (2) the expected “safe” behavior, and (3) the observed behavior. Pass/fail is binary only when you’ve agreed on what “abuse” means—e.g., “tool was invoked with parameters that should have been rejected.” Without that, you’re just flagging “the model did something” without a clear remediation.

A nuance that catches teams off guard: the same prompt can succeed or fail depending on temperature, retrieval results, or minor prompt tweaks. So fixtures should be run multiple times or with explicit seeds where possible, and your success criterion might be “at least one success in N runs” rather than “always fails.” That’s a tradeoff—stricter criteria are cleaner for engineering, looser ones reflect the reality of probabilistic systems.

RAG Poisoning: When the Knowledge Base Is the Vector

If the only thing you test is the chat UI, you’re missing the channel that often has the weakest access controls: the ingested data. RAG poisoning is the practice of putting malicious or misleading content into the retrieval index so that the model surfaces it when answering. The attack isn’t against the model’s weights; it’s against the pipeline’s trust in its own context.

Recent work has made the threat concrete. Attackers can achieve high success rates with minimal poisoned documents—sometimes a single document—by making the poison text highly relevant to target queries and fluent enough to pass filters. Data loader attacks have shown that common ingestion paths (DOCX, HTML, PDF) are vulnerable to obfuscated or injected content, so the poison can enter through “normal” upload flows. Defenses like perplexity filtering help only so much; optimized poisons can slip through. So when you red team a RAG app, you need fixtures that (1) add or modify documents in the corpus (or simulate that), (2) craft queries that should retrieve the poison, and (3) define what “misleading” or “malicious” output looks like—e.g., wrong facts, injected instructions, or unsafe recommendations.

Scoping matters here too. Can testers add documents, or only query? If only query, you’re testing retrieval and generation given a fixed corpus; if they can add documents, you’re testing ingestion and access control as well. Both are valuable; they answer different questions.

Success Criteria and Severity

“We made the model say something bad” is not a finding. A finding is: under what conditions, with what payload, does the system violate a defined security property, and what’s the impact?

Success criteria should be explicit. For injection: “The model reveals the system prompt,” or “The model follows instructions embedded in user content that contradict the system prompt.” For tool abuse: “The model invokes tool X with parameters that exceed the user’s authority.” For RAG: “The model cites or relies on poisoned content and produces incorrect or unsafe guidance.” Each of these can be turned into a test: given fixture F, the system must not do X. Then you run the fixture and record pass/fail (and if you’re being thorough, run it multiple times and report attack success rate).

Severity is trickier for LLMs than for a typical CVE. A buffer overflow is either exploitable or not. An LLM might comply with a jailbreak 5% of the time. So severity should combine likelihood (e.g., how often the attack succeeds under your test setup) with impact (e.g., data exposure, unauthorized action, reputational harm). Frameworks like OWASP’s taxonomy for LLM apps and structured benchmarks like HarmBench give you a shared language—prompt injection, training data extraction, excessive agency, and so on—so you can map your fixtures to known categories and argue severity in a way that doesn’t depend on “we tried it once and it worked.”

Deliverables Engineering Can Act On

The goal of the engagement isn’t to hand over a list of prompts that “beat” the model. It’s to hand over actionable findings: clear failure conditions, steps to reproduce, and a way to verify that a fix works.

For each finding, include: (1) a short title and severity, (2) the threat model (what capability you assume of the attacker), (3) the exact fixture—input(s), context if relevant, and expected vs. actual behavior, (4) impact (what an attacker gains), and (5) remediation direction (e.g., input validation, tool authorization checks, or retrieval guardrails). Where possible, provide the fixture in the same format your team uses for automated tests—e.g., a small suite that can be run in CI so that “we fixed prompt injection” is backed by “and these 12 cases still pass.”

Red teaming an LLM application is not about outsmarting the model once. It’s about defining the boundaries of acceptable behavior, encoding those boundaries as tests, and closing the gap between “we hope it’s safe” and “we know it fails under these conditions—and we’re working on it.” That’s the kind of assurance that holds up when the next prompt injection or poisoned document shows up in the wild.

How to Red-Team an LLM Application: A Security Engineer's Guide to AI Adversarial Testing

Stay Updated on AI Risk & Compliance

Scoping: what are you actually testing?

Attack Fixtures: Prompt Injection and Tool Abuse

RAG Poisoning: When the Knowledge Base Is the Vector

Success Criteria and Severity

Deliverables Engineering Can Act On

Get an independent
AI risk assessment

Scoping: what are you actually testing?

Attack Fixtures: Prompt Injection and Tool Abuse

RAG Poisoning: When the Knowledge Base Is the Vector

Success Criteria and Severity

Deliverables Engineering Can Act On

Get an independentAI risk assessment

Get an independent
AI risk assessment