Models & ArchitectureContrastive Language-Image Pretraining

CLIP

A vision-language model that learns shared representations of images and text so they can be compared in the same embedding space.

CLIP is a model introduced by OpenAI that learns to align images and text in the same embedding space. It is trained on large numbers of image-caption pairs so that matching images and captions end up close together as vectors.

This was a major milestone for multimodal AI because it allowed models to connect visual content and natural language more directly. A text description can retrieve relevant images, and image features can be used to guide generation systems.

Why CLIP matters: it helped make modern image search, prompt-based image generation, and cross-modal retrieval far more effective.

What CLIP Is Commonly Used For

Image-text retrieval — search images with natural language queries
Prompt conditioning — guide image generation systems
Zero-shot classification — classify images without task-specific retraining
Cross-modal embeddings — align vision and language meaning

CLIP influenced a large wave of multimodal systems and became a foundational component in many text-to-image workflows.

Related Terms

← Back to Glossary