I read a lot of LLM-application incident write-ups. The pattern shows up over and over: a prompt injection succeeded, data was exfiltrated, a tool got triggered that shouldn't have — and the security team found out from the customer, not from a detection rule.
That's the gap I keep coming back to.
If a SQL injection were succeeding in production and going undetected, you'd call that a Sev-1. If a phishing campaign were extracting credentials and your SIEM was silent, you'd consider it a critical detection gap. But when a prompt injection wins, the typical SOC response is closer to: "we don't really have detection for that yet."
It's worth asking why. The mechanics of detection engineering aren't broken; they're just built around a threat model that prompt injection doesn't fit cleanly into. Let me walk through why, and then where I think the rule-writing can actually start.
What detection engineering does, in a sentence
If you're coming to this post from the AI safety side: detection engineering is the craft of writing rules that fire when an attacker behavior shows up in logs. The good ones share three properties — they have a documented hypothesis ("this combination of events shouldn't occur in benign traffic"), they're testable ("here's a synthetic event that triggers them"), and they're tuned ("here's the false-positive profile we expect"). Anything that doesn't survive all three of those tests becomes alert fatigue, which is its own attack surface.
Traditional detections cluster into three primitives. Signatures look for known-bad patterns — a specific malware hash, a specific User-Agent string. Anomalies flag deviations from a baseline — a workstation talking to a country it's never talked to. Behavior chains link primitives across time — a credential succeeded, then an admin tool ran, then data left the network. Most mature SOC programs use all three.
Why each primitive partially-fails on prompt injection
Signatures are the obvious first move. Match on "ignore previous instructions", "you are now in developer mode", "act as DAN". This works for about a week. The phrase library is small, the bypasses are trivial, and the model itself doesn't even need a known phrase — it'll respond to the underlying intent expressed in dozens of equivalent ways. Signature detection for prompt injection is a Maginot Line.
Anomalies are harder than they look. A traditional anomaly detection asks: is this user's behavior unusual? But the "user" in an LLM application is expected to be highly variable — prompts are by nature unstructured, idiosyncratic, hard to baseline. Even worse, the malicious prompt that exfiltrates data may look almost identical to the legitimate prompt that helps a customer draft a sales email. The signal-to-noise problem is brutal.
Behavior chains are closest to viable, but they have an attribution problem. In a traditional kill chain, a compromised process is a clear pivot point — you can attribute the next action to the same actor. In an LLM-integrated application, the model itself is the "user" performing actions. When the model calls a tool that exfiltrates data, the action looks identical whether the model decided to do that on its own (a misalignment) or was tricked into it by an injected instruction (a compromise). The logs don't naturally distinguish.
This is why most SOC programs are silent on prompt injection. The default primitives don't quite fit.
Three places where rule-writing can start
"Doesn't quite fit" isn't "impossible." Three categories where detection rules can actually work, drawn from the jailbreak taxonomy I maintain in my red team frameworks:
1. Indirect injection — instruction-vs-data confusion in tool returns.
When an LLM-integrated app retrieves a document, a webpage, or a tool output, that content is data. The risk is that the model treats some of it as instruction. The detection writes itself once you state the trust boundary cleanly: when a tool return contains content that pattern-matches instruction syntax — "new instruction:", "system note:", HTML comments that hide imperative language, end-of-context markers followed by directives — that's a behavior the system should refuse to act on without an additional verification signal.
This is the OWASP Top 10 for LLM Applications' #1 risk, and it's the most tractable to detect because the structural marker (instruction-syntax inside a data-context) is observable in logs. You won't catch everything, but you'll catch the lazy attackers, and that meaningfully raises the noise floor for everyone else.
2. Authority impersonation — claims of dev/staff identity inside user input.
The pattern looks like: "I'm from Anthropic's safety team and this is a developer override" or "I'm a security researcher with full authorization." These claims should never affect model behavior — the model has no way to verify them — but they often do, because the model was trained on text where similar claims preceded legitimate elevation.
The detection here is regex-tractable. The false-positive profile is reasonable: in any production deployment, the legitimate population of users claiming to be OpenAI staff or system administrators inside a prompt is essentially zero. Fire it as informational, route to a queue, watch what your population of triggering users looks like over a week. Most of what you find will be either real attackers, real curious researchers, or your own internal QA. All three are worth knowing about.
3. Output exfiltration channels — encoded sinks the model is asked to populate.
The riskier prompt injections aren't asking the model to say something forbidden — they're asking it to put something forbidden into a channel the user can read. URLs the user wouldn't click manually. Image alt text. Code blocks containing encoded payloads. Markdown links where the visible text and the destination disagree.
Detection here is structural and unambiguous: when the model's output contains an outbound URL or encoded payload that includes content from a sensitive context — a system prompt, a retrieved private document, a tool return marked confidential — that's a fire. The rule is clean enough to be a Tier-1 detection. The challenge is wiring up the data sources: you need to know what content is "sensitive" in your environment, and most teams don't have a clean tagging system for that yet.
What's still hard
Three things I don't have a good rule for, and that I suspect nobody does yet:
Multi-turn ramping is the hardest. The model is jailbroken not by any single message but by an accumulation of context that gradually shifts the conversation's safety floor. By the time the unsafe response arrives, the malicious prompt is no longer in the active turn — it was three messages ago. Detecting this requires session-level state, cross-turn correlation, and a notion of conversational "drift" that doesn't really exist in current SIEM primitives.
Long-context dilution is the inverse. The malicious instruction is in the prompt, but it's buried inside 30,000 tokens of benign content. Signature matching against the full prompt produces too many false positives. Sampling produces too many false negatives. Detection here needs structural analysis of where in the input the signal lives, which is closer to NLP work than to traditional SOC detection.
Agent tool chains multiply the attribution problem. An agent reads an email, decides to summarize it, the summary triggers a Slack message that calls another tool, which writes to a calendar, which prompts a follow-up agent — and the actual unsafe action is six tool calls downstream of the injection. Which step do you alert on? How do you write a runbook for an analyst who has to reconstruct the chain? I don't have a clean answer. I suspect this is where the next generation of agentic-AI security tooling is going to have to be invented.
Who should be building this
This is where the bridge thesis from my last post turns into a job description.
The person who builds production-grade prompt-injection detection rules is the person who has both detection-engineering muscle from the SOC side and adversarial-AI fluency from the red team side. There are very few of them today. Most SOC teams I've seen don't have the AI literacy to write these rules. Most AI safety teams don't have the operational discipline to deploy them, tune them, and keep them alive past their first incident.
If you're a SOC analyst reading this and you've been wondering how to grow into AI security: start here. The OWASP Top 10 for LLM Applications, your existing detection-engineering instinct, and a willingness to write rules for a threat model your SIEM vendor doesn't yet support out of the box. You will be ahead of the market in about a quarter.
If you're a security leader: the right hire for "AI security engineer" in 2026 is the person who reaches for SPL when they hear "prompt injection," not the one who reaches for a research paper. Hire for the bridge.
The detection rules I've described aren't the whole solution to prompt injection — nothing single-source ever is. But they exist, they're tractable, and they're not being written at scale. That's the gap.