Tech Unpacked – Research & Fundamentals with Nitin Sharma: Kafka

Sunday, August 10, 2025

🚀 How to Stop Kafka Lag: Root Causes, Best Practices, and Prevention Strategies

Why Kafka Lag Matters

Apache Kafka is the backbone for many high-scale systems — powering payments, order tracking, fraud detection, and event-driven microservices.

But when Kafka lag creeps in, your real-time system becomes near-real-time, which can lead to:

Delayed payments or settlement
Missed SLA agreements
Data processing backlogs
Increased infrastructure cost from retries

In financial or mission-critical domains, lag is not just a performance issue — it’s a business risk.

What is Kafka Lag?

Kafka lag is the difference between the latest message offset in a partition and the last committed offset by a consumer group.

Example:

Partition offset head: 1000
Last committed offset: 800
Lag = 200 messages

If your lag keeps growing instead of shrinking, you’re in trouble.

Root Causes of Kafka Lag

Through real-world experience with large-scale, payment-heavy systems, I’ve seen the same lag patterns appear:

Slow Consumer Processing
- Heavy DB calls or synchronous API calls inside the consumer loop.
Insufficient Parallelism
- Too few consumers for the number of partitions.
Hot Partitions
- Poor key distribution causing one partition to carry most traffic.
Broker Bottlenecks
- Disk or network saturation on Kafka brokers.
Large Message Sizes
- Serialization/deserialization overhead impacting poll rates.
Consumer Group Rebalancing
- Frequent membership changes causing pauses in consumption.

Best Practices to Prevent Kafka Lag

1. Optimize Consumer Throughput

Keep business logic light — push heavy processing to async workers.
Batch process records with max.poll.records.
Commit offsets frequently to avoid replay storms.

2. Scale Consumers Effectively

Number of consumers should match or be less than partition count.
Use consumer group scaling during traffic peaks.

3. Fix Partition Skew

Review key hashing logic.
If hot partitions exist, consider re-keying or adding partitions.

4. Tune Consumer Configurations

Key configs to watch:

max.poll.records=500
max.poll.interval.ms=300000
fetch.min.bytes=50000
fetch.max.wait.ms=500

Tune based on throughput vs. latency trade-offs.

5. Monitor Lag Proactively

Use Prometheus JMX Exporter or Burrow for lag metrics.
Alert when lag exceeds business-defined thresholds.

6. Handle Third-Party Dependencies

For load tests, use mock gateways to avoid hitting real partner APIs.
Apply circuit breakers to isolate external failures.

Case Study: Reducing Lag at Scale

While working at Rapido, our trip location tracking service faced ~2M message lag during evening peak hours.

Root cause: Consumers were enriching each message with DB lookups.

Solution:

Offloaded enrichment to a downstream async process.
Increased partitions from 6 → 18.
Tuned max.poll.records from 50 → 500.
Result: Lag dropped from 2M to under 5K during peak.

Checklist for Kafka Lag Prevention

Keep consumer logic lightweight
Scale consumer groups with partitions
Fix partition key distribution
Tune consumer configurations
Batch process where possible
Monitor lag continuously
Mock external dependencies during load
Test with production-like data in staging

Final Thoughts

Kafka lag is inevitable under certain conditions — but chronic lag is a design flaw.

By combining good partition strategy, optimized consumers, and proactive monitoring, you can maintain near-real-time processing even at massive scale.

If you’re building a Kafka-heavy system, remember:

Lag prevention is a design decision, not a firefight.

Happy Learning :)

Friday, May 6, 2022

Kafka - Core Concepts

Let's discuss about common terms been used in kafka and about their roles in distributed architecture.

Producer

An application that sends message to Kafka

Message

Small to medium sized piece of data

Consumer

An application that reads data from Kafka

Cluster

A group of computers sharing workload for a common purpose

Topic

A topic is a unique name for Kafka stream

Partition

Kafka topics are divided into several partitions. While the topic is a logical comcept in Kafka, a partitin is the smallest storage unit that holds a subset of records owned by a topic. Each partition is a single log file where records are written to it in an append-only fashion.

Offset

A sequence id given to messages as they arrive in a partition

Global Unique identifier of the a message?

Topic Name -> Partition Number -> Offfset

Consumer Group

A group of consumers acting as a single logical unit

Can multiple kafka consumers read same message from the partition?

It depends on group ID. Suppose you have a topic with 12 partitions. If you have 2 Kafka consumers with the same Group Id, they will both read 6 partitions, meaning they will read different set of partitions = different set of messages. If you have 4 Kafka cosnumers with the same Group Id, each of them will all read three different partitions etc.
But when you set different Group Id, the situation changes. If you have two Kafka consumers with different Group Id they will read all 12 partitions without any interference between each other. Meaning both consumers will read the exact same set of messages independently. If you have four Kafka consumers with different Group Id they will all read all partitions etc.

Within same group: NO

Two consumers (Consumer 1, 2) within the same group (Group 1) CAN NOT consume the same message from partition (Partition 0).

Across different groups: YES

Two consumers in two groups (Consumer 1 from Group 1, Consumer 1 from Group 2) CAN consume the same message from partition (Partition 0).

Tech Unpacked – Research & Fundamentals with Nitin Sharma

Popular Posts

Search This Blog

Sunday, August 10, 2025

🚀 How to Stop Kafka Lag: Root Causes, Best Practices, and Prevention Strategies

Why Kafka Lag Matters

What is Kafka Lag?

Root Causes of Kafka Lag

Best Practices to Prevent Kafka Lag

1. Optimize Consumer Throughput

2. Scale Consumers Effectively

3. Fix Partition Skew

4. Tune Consumer Configurations

5. Monitor Lag Proactively

6. Handle Third-Party Dependencies

Case Study: Reducing Lag at Scale

Checklist for Kafka Lag Prevention

Final Thoughts

Friday, May 6, 2022

Kafka - Core Concepts

My Profile

Featured Post

🚀 Introducing the Universal API Testing Tool — Built to Catch What Manual Testing Misses

!! IMPORTANT LINKS !!

!! INTERESTING TALKS !!

Contact Form

Labels

Total Pageviews