Adversarial Attack
An input crafted to fool a machine learning model into making a wrong prediction, often imperceptible to humans.
Adversarial attacks are inputs specifically designed to deceive ML models. In computer vision, a few carefully chosen pixels can make a model confidently misclassify an image. In NLP, subtle rephrasing can change a model's output dramatically.
These attacks reveal that neural networks learn surprisingly brittle features. They're a major concern for deployed systems, especially in security-critical applications like facial recognition, content moderation, and fraud detection.
Adversarial robustness remains an open problem. Modern LLM attacks include prompt injection and jailbreaks — text-based adversarial attacks that bypass safety training.