No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.
Thoughts on whatever I build, break, and learn in AI, engineering, and more.
A raw, practical breakdown of the heartbeat mechanism — what it is, how it works, and how I implemented it during my internship at AI Planet this week.
A beginner-friendly breakdown of CLAUDE.md and AGENTS.md — what they are, how they differ, and the simple setup that future-proofs your workflow across any AI coding tool.
How vLLM's paged attention borrows virtual memory concepts from operating systems to solve the KV cache memory fragmentation problem — making LLM inference faster and more memory-efficient at scale.
A clear, visual explanation of the KV Cache — the optimization that makes autoregressive text generation fast by storing Key and Value vectors instead of recomputing them for every new token.
AI memory isn't like human memory—models forget everything. What we call memory is actually smart storing, searching, and injecting context at the right time using external systems.
A clear breakdown of what the encoder and decoder each do in a Transformer — their internal structure, how multi-head self-attention works, what cross-attention is, and when you'd use encoder-only vs decoder-only vs full encoder-decoder models.
How self-attention produces contextual embeddings by computing a weighted sum of value vectors — and what it means that the same word gets a different representation depending on the sentence it appears in.
A deep dive into the softmax function — why it's used in self-attention, how it converts raw dot product scores into probabilities, and why the numerically stable variant (subtracting the max) matters in practice.
A full step-by-step numerical walkthrough of self-attention using the sentence "please study man" — computing Q, K, V vectors, raw attention scores, softmax weights, and final contextual output vectors from scratch.
A deep intuitive breakdown of the Q, K, V mechanism in self-attention — using a database retrieval analogy and real weight matrix math to show exactly how Transformers decide which words to attend to.
A deep dive into the Transformer architecture introduced in the landmark 2017 paper — what it is, how it works, why it replaced RNNs, and why every modern AI model from GPT to Gemini traces its roots here.
Breaking down the difference between training, fine-tuning, and inference—why they're not the same thing, what actually happens in each stage, and why understanding this makes LLM systems way less confusing.
Why tokenization is the most underrated part of LLMs—how tokens aren't words, why they affect cost and performance, and why bad tokenization breaks everything downstream.
A complete breakdown of encoder-decoder architectures—how they compress sequences into context vectors, generate outputs step-by-step, why teacher forcing matters, and the four key limitations that led to attention mechanisms.
Building a fine-tuned AI to translate legal jargon into plain English—from FLAN-T5 failures to Gemma-2B success using QLoRA on a free GPU, and the engineering lessons learned along the way.
How I exposed my portfolio blog system as an MCP server so Claude could operate it with natural language — and the 5 small but painful bugs that stood in the way.
An honest, unstructured brain dump about embeddings, vector databases, and re-ranking—from confusion about what the numbers mean to understanding coordinates, similarity search, and retrieval optimization.
Breaking down quantization from scary optimization technique to simple concept—how reducing bit precision makes models smaller and faster, and why calibration matters more than the math.
A practical inference benchmark comparing DistilBERT performance on CPU vs GPU—measuring latency, throughput, and memory across different batch sizes to understand what actually happens in production.
Learning machine learning alone in a Tier-3 city without mentors, bootcamps, or a tech ecosystem—why constraints became advantages and how building in public taught me more than any course.
Handpicked for you
A raw, practical breakdown of the heartbeat mechanism — what it is, how it works, and how I implemented it during my internship at AI Planet this week.
A beginner-friendly breakdown of CLAUDE.md and AGENTS.md — what they are, how they differ, and the simple setup that future-proofs your workflow across any AI coding tool.
How vLLM's paged attention borrows virtual memory concepts from operating systems to solve the KV cache memory fragmentation problem — making LLM inference faster and more memory-efficient at scale.
A clear, visual explanation of the KV Cache — the optimization that makes autoregressive text generation fast by storing Key and Value vectors instead of recomputing them for every new token.
A clear breakdown of what the encoder and decoder each do in a Transformer — their internal structure, how multi-head self-attention works, what cross-attention is, and when you'd use encoder-only vs decoder-only vs full encoder-decoder models.