Models & ArchitectureContrastive Language-Image Pretraining
CLIP
A vision-language model that learns shared representations of images and text so they can be compared in the same embedding space.
CLIP is a model introduced by OpenAI that learns to align images and text in the same embedding space. It is trained on large numbers of image-caption pairs so that matching images and captions end up close together as vectors.
This was a major milestone for multimodal AI because it allowed models to connect visual content and natural language more directly. A text description can retrieve relevant images, and image features can be used to guide generation systems.
Why CLIP matters: it helped make modern image search, prompt-based image generation, and cross-modal retrieval far more effective.
What CLIP Is Commonly Used For
- Image-text retrieval — search images with natural language queries
- Prompt conditioning — guide image generation systems
- Zero-shot classification — classify images without task-specific retraining
- Cross-modal embeddings — align vision and language meaning
CLIP influenced a large wave of multimodal systems and became a foundational component in many text-to-image workflows.