Vision Language Models: How Machines Learned to See and Understand

What VLMs Actually Do

Vision Language Models (VLMs) understand both images and text in the same semantic space. That's the key insight: pixels and words can be represented in a way where "a photo of a cat" and the actual image of a cat end up close together mathematically.

This dual understanding enables models to:

Answer questions about images
Generate captions
Find images based on text descriptions
Compare visual and textual concepts

The Three Core Components

1. Vision Encoder

The vision encoder processes images into embeddings—numerical representations that capture visual meaning.

Common architectures:

ViT (Vision Transformers): Splits images into patches, treats them like tokens
CNN backbones: Traditional convolutional networks like ResNet

The flow:

Pixel Input → Vision Encoder → Visual Tokens/Features

For example, CLIP uses ViT, while older models relied on ResNet.

2. Text Encoder / Decoder

This handles the language side using transformer architecture.

Key processes:

Tokenization: breaking text into units
Embedding: converting tokens to vectors
Language generation: producing coherent text output

3. Fusion Mechanism

This is where vision and text actually interact. There are three main approaches:

Early Fusion: Combines modalities at the input level before processing

Mid Fusion: Uses cross-attention between vision and text during processing—they "talk" to each other in the middle layers

Late Fusion: Separate encoders process each modality independently, then combine results at the end

Two Major Paradigms

1. Contrastive Learning (CLIP-Style)

This approach learns by matching image-text pairs.

Image → Vision Encoder → Image Embedding
Text → Text Encoder → Text Embedding
→ Compute Similarity Score

The training objective:

Matching pairs ("dog photo" + actual dog image) should have high similarity
Non-matching pairs should be far apart in embedding space
Uses contrastive loss (InfoNCE) to enforce this

This is how CLIP works—it learns to align visual and textual representations without explicit labels.

2. Generative Models

These models take images and generate text descriptions or answers.

Image → Vision Encoder → Visual Tokens
[Visual Tokens + Text Tokens] → LLM → Generated Text

The LLM receives both visual tokens and text tokens as input, allowing it to:

Answer questions about images
Describe what it sees
Reason about visual content

Examples include GPT-4V, LLaVA, and other multimodal chatbots.

Why This Architecture Works

The breakthrough is treating vision and language as different views of the same semantic space. A photo of a sunset and the phrase "beautiful orange sky at dusk" aren't fundamentally different concepts—they're just different modalities expressing the same meaning.

By learning this shared representation, VLMs can bridge the gap between what we see and how we describe it.

The Two Paths Forward

Contrastive models excel at retrieval and matching tasks—finding images from text descriptions or vice versa.

Generative models excel at understanding and reasoning—answering complex questions about visual content.

Both approaches have their place, and many modern systems combine elements of both.