HomeGlossaryAdversarial Attack
Safety & Alignment

Adversarial Attack

An input crafted to fool a machine learning model into making a wrong prediction, often imperceptible to humans.

Adversarial attacks are inputs specifically designed to deceive ML models. In computer vision, a few carefully chosen pixels can make a model confidently misclassify an image. In NLP, subtle rephrasing can change a model's output dramatically.

These attacks reveal that neural networks learn surprisingly brittle features. They're a major concern for deployed systems, especially in security-critical applications like facial recognition, content moderation, and fraud detection.

Defensive techniques: adversarial training (training on adversarial examples), input sanitization, and certified robustness methods.

Adversarial robustness remains an open problem. Modern LLM attacks include prompt injection and jailbreaks — text-based adversarial attacks that bypass safety training.

Related Terms

← Back to Glossary