Heartbeat Mechanism — The Simplest Idea Behind Every Reliable System
A raw, practical breakdown of the heartbeat mechanism — what it is, how it works, and how I implemented it during my internship at AI Planet this week.
Heartbeat Mechanism — The Simplest Idea Behind Every Reliable System
This week at my internship at AI Planet, I had to implement a heartbeat mechanism for a service we're building.
I'd heard the term before. Nodded along when people mentioned it. Never actually sat down and understood what it was doing under the hood.
So I did. And like most good ideas in engineering — it's embarrassingly simple once you see it.
What Is a Heartbeat?
A heartbeat is a small, regular signal one service sends to another that says: "I'm still alive."
That's it. A tiny ping. Every few seconds. Over and over.
As long as those pings keep arriving, the receiver knows the sender is healthy. The moment they stop — the receiver assumes something went wrong and takes action.
Same idea as a literal heartbeat. Doctor doesn't need to poke you every few seconds to check if you're alive. As long as the monitor is beeping, you're fine. The moment it flatlines — that's the signal.
Why Does This Even Need to Exist?
Because services fail silently all the time.
A server crashes. A network drops. A process freezes without throwing an error. The tricky part isn't the failure itself — it's that everything around the failed service keeps working, keeps routing requests to it, keeps expecting responses that will never come. Users see broken pages. Data gets lost. Nobody knows why.
The heartbeat creates one simple rule: "If I don't hear from you within X seconds, I assume you're dead and I act accordingly."
No ambiguity. No waiting around. Just a clear timeout and a defined response.
How It Works
Three moving parts:
Sender — the service that's running. Emits a small signal every N seconds. Could be as simple as an HTTP ping to an endpoint or a message to a queue.
Receiver — listens for those signals and tracks when the last one arrived.
Timeout — the threshold. If no heartbeat arrives within, say, 30 seconds, the receiver marks the sender as dead and triggers whatever recovery logic is defined.
Sender: ping → ping → ping → ping → [crash]
Receiver: ✓ ✓ ✓ ✓ ... 30s ... → DEAD → take action
The receiver doesn't need to understand why the sender stopped. It just knows it did, and that's enough to act.
What I Actually Built at AI Planet
At AI Planet this week, I was working on a service that needed to stay reliably online and report its health to the rest of the system. If it went down, other parts of the pipeline needed to know immediately — not after a user reported an issue.
What I implemented:
- The service emits a heartbeat every 10 seconds to a central health endpoint
- The receiver tracks the last seen timestamp per service ID
- If a service misses 3 consecutive heartbeats (30 seconds), it's marked unhealthy
- An alert fires and the service gets flagged for restart
The "3 missed in a row" rule is important. One missed heartbeat is usually a network hiccup — not a real failure. Two is suspicious. Three means something is actually wrong. Waiting for confirmation before acting avoids false alarms that cause unnecessary restarts.
10s: heartbeat received ✓
20s: heartbeat received ✓
30s: heartbeat received ✓
40s: missed — wait
50s: missed — wait
60s: missed — 3 strikes → mark unhealthy → alert
This clicked late for me — the timeout isn't just "how long until we panic." It's a deliberate design decision balancing detection speed vs false alarm rate. Too short and you're restarting healthy services over network blips. Too long and a real failure goes unnoticed for minutes.
What a Heartbeat Can Carry
A basic heartbeat just says "I'm alive." But you can pack more into it:
{
"service_id": "inference-worker-3",
"timestamp": "2026-04-17T10:42:00Z",
"status": "healthy",
"cpu_usage": 43,
"memory_usage": 61,
"version": "1.2.4"
}
Now the receiver doesn't just know the service is alive — it knows if it's stressed, if it's running the right version, and exactly when it last checked in. The heartbeat becomes a mini health report on every pulse.
Where You've Already Seen This
You interact with heartbeat systems constantly without realizing it:
- WhatsApp / Zoom — when your phone stops sending heartbeats, the app shows you as offline
- Online games — missed heartbeats = "lost connection to server"
- Kubernetes — liveness probes are literally heartbeats; fail them and the pod gets killed and replaced
- Load balancers — a server stops responding → traffic is rerouted to healthy instances automatically
Every system that stays reliably online has something like this running quietly in the background.
The Tricky Parts Nobody Mentions
Too frequent = wasteful. Every 500ms floods the network and drains mobile batteries. 5–30 seconds is the practical range for most services.
Too infrequent = risky. A 5-minute heartbeat interval means a crashed service could go undetected for up to 5 minutes. In a production pipeline, that's a lot of damage.
Split-brain problem. Sometimes the network between two services breaks, but both services are actually fine. Each thinks the other is dead. Both might try to take over each other's responsibilities. This is one of the hardest failure modes in distributed systems — and heartbeats alone don't solve it.
Key Takeaways
- A heartbeat is a regular "I'm alive" signal between services — one of the simplest and most universally used patterns in distributed systems
- The receiver tracks last-seen timestamps and fires recovery logic when the timeout is breached
- Wait for multiple missed heartbeats before acting — single misses are usually network noise
- Heartbeats can carry health data (CPU, memory, version) — turning a ping into a health report
- Timeout duration is a deliberate tradeoff: short = fast detection but more false alarms; long = fewer false alarms but slower response
Implementing this at AI Planet this week was one of those moments where the gap between "I've heard of this" and "I actually understand this" closed completely. The concept is simple. The nuances — timeout tuning, false alarm handling, split-brain edge cases — are where the real engineering judgment lives.
And that's exactly the kind of thing you only learn by building it.
Learning in public from Pune. Building at AI Planet.