Safety & Alignment
Prompt Injection
An attack where malicious instructions in user input override an AI system's original instructions.
Prompt injection is a security vulnerability where an attacker embeds instructions in user input that hijack the behavior of an LLM. If a chatbot is told "Ignore previous instructions and do X," it may comply — especially if the system doesn't have strong safeguards.
Indirect prompt injection is even more dangerous: malicious instructions hidden in retrieved web pages, documents, or emails that an LLM reads as part of its normal operation.
Why it matters: prompt injection is the equivalent of SQL injection for LLMs. It can leak system prompts, exfiltrate data, or cause tools to misfire.
Defenses include input sanitization, clear instruction hierarchies, output validation, and treating all user input as untrusted. It's still largely an unsolved problem in production AI.