2026-01-08

Vision Language Models: How Machines Learned to See and Understand

Breaking down Vision Language Models into their core components—vision encoders, text encoders, fusion mechanisms—and the two main paradigms: contrastive learning (CLIP-style) and generative models.

Vision Language ModelsVLMMachine LearningCLIPTransformersComputer VisionLearning In Public

What VLMs Actually Do

Vision Language Models (VLMs) understand both images and text in the same semantic space. That's the key insight: pixels and words can be represented in a way where "a photo of a cat" and the actual image of a cat end up close together mathematically.

This dual understanding enables models to:

  • Answer questions about images
  • Generate captions
  • Find images based on text descriptions
  • Compare visual and textual concepts

The Three Core Components

1. Vision Encoder

The vision encoder processes images into embeddings—numerical representations that capture visual meaning.

Common architectures:

  • ViT (Vision Transformers): Splits images into patches, treats them like tokens
  • CNN backbones: Traditional convolutional networks like ResNet

The flow:

Pixel Input → Vision Encoder → Visual Tokens/Features

For example, CLIP uses ViT, while older models relied on ResNet.

2. Text Encoder / Decoder

This handles the language side using transformer architecture.

Key processes:

  • Tokenization: breaking text into units
  • Embedding: converting tokens to vectors
  • Language generation: producing coherent text output

3. Fusion Mechanism

This is where vision and text actually interact. There are three main approaches:

Early Fusion: Combines modalities at the input level before processing

Mid Fusion: Uses cross-attention between vision and text during processing—they "talk" to each other in the middle layers

Late Fusion: Separate encoders process each modality independently, then combine results at the end

Two Major Paradigms

1. Contrastive Learning (CLIP-Style)

This approach learns by matching image-text pairs.

Image → Vision Encoder → Image Embedding
Text → Text Encoder → Text Embedding
→ Compute Similarity Score

The training objective:

  • Matching pairs ("dog photo" + actual dog image) should have high similarity
  • Non-matching pairs should be far apart in embedding space
  • Uses contrastive loss (InfoNCE) to enforce this

This is how CLIP works—it learns to align visual and textual representations without explicit labels.

2. Generative Models

These models take images and generate text descriptions or answers.

Image → Vision Encoder → Visual Tokens
[Visual Tokens + Text Tokens] → LLM → Generated Text

The LLM receives both visual tokens and text tokens as input, allowing it to:

  • Answer questions about images
  • Describe what it sees
  • Reason about visual content

Examples include GPT-4V, LLaVA, and other multimodal chatbots.

Why This Architecture Works

The breakthrough is treating vision and language as different views of the same semantic space. A photo of a sunset and the phrase "beautiful orange sky at dusk" aren't fundamentally different concepts—they're just different modalities expressing the same meaning.

By learning this shared representation, VLMs can bridge the gap between what we see and how we describe it.

The Two Paths Forward

Contrastive models excel at retrieval and matching tasks—finding images from text descriptions or vice versa.

Generative models excel at understanding and reasoning—answering complex questions about visual content.

Both approaches have their place, and many modern systems combine elements of both.

Related Reading

Subscribe to my newsletter

No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.

Interests (optional)

Unsubscribe anytime. Your email is safe with me.