Vision Language Models: How Machines Learned to See and Understand
Breaking down Vision Language Models into their core components—vision encoders, text encoders, fusion mechanisms—and the two main paradigms: contrastive learning (CLIP-style) and generative models.
What VLMs Actually Do
Vision Language Models (VLMs) understand both images and text in the same semantic space. That's the key insight: pixels and words can be represented in a way where "a photo of a cat" and the actual image of a cat end up close together mathematically.
This dual understanding enables models to:
- Answer questions about images
- Generate captions
- Find images based on text descriptions
- Compare visual and textual concepts
The Three Core Components
1. Vision Encoder
The vision encoder processes images into embeddings—numerical representations that capture visual meaning.
Common architectures:
- ViT (Vision Transformers): Splits images into patches, treats them like tokens
- CNN backbones: Traditional convolutional networks like ResNet
The flow:
Pixel Input → Vision Encoder → Visual Tokens/Features
For example, CLIP uses ViT, while older models relied on ResNet.
2. Text Encoder / Decoder
This handles the language side using transformer architecture.
Key processes:
- Tokenization: breaking text into units
- Embedding: converting tokens to vectors
- Language generation: producing coherent text output
3. Fusion Mechanism
This is where vision and text actually interact. There are three main approaches:
Early Fusion: Combines modalities at the input level before processing
Mid Fusion: Uses cross-attention between vision and text during processing—they "talk" to each other in the middle layers
Late Fusion: Separate encoders process each modality independently, then combine results at the end
Two Major Paradigms
1. Contrastive Learning (CLIP-Style)
This approach learns by matching image-text pairs.
Image → Vision Encoder → Image Embedding
Text → Text Encoder → Text Embedding
→ Compute Similarity Score
The training objective:
- Matching pairs ("dog photo" + actual dog image) should have high similarity
- Non-matching pairs should be far apart in embedding space
- Uses contrastive loss (InfoNCE) to enforce this
This is how CLIP works—it learns to align visual and textual representations without explicit labels.
2. Generative Models
These models take images and generate text descriptions or answers.
Image → Vision Encoder → Visual Tokens
[Visual Tokens + Text Tokens] → LLM → Generated Text
The LLM receives both visual tokens and text tokens as input, allowing it to:
- Answer questions about images
- Describe what it sees
- Reason about visual content
Examples include GPT-4V, LLaVA, and other multimodal chatbots.
Why This Architecture Works
The breakthrough is treating vision and language as different views of the same semantic space. A photo of a sunset and the phrase "beautiful orange sky at dusk" aren't fundamentally different concepts—they're just different modalities expressing the same meaning.
By learning this shared representation, VLMs can bridge the gap between what we see and how we describe it.
The Two Paths Forward
Contrastive models excel at retrieval and matching tasks—finding images from text descriptions or vice versa.
Generative models excel at understanding and reasoning—answering complex questions about visual content.
Both approaches have their place, and many modern systems combine elements of both.