CentralMesh.io

Kafka Fundamentals for Beginners
AdSense Banner (728x90)

4.7 In-Sync Replicas and Acknowledgements

Ensuring data reliability with ISR and acknowledgement modes.

Video Coming Soon

In-Sync Replicas (ISR) and Acknowledgements

Overview

ISR and acknowledgement modes are crucial for data reliability in Kafka's distributed system.

What is ISR?

In-Sync Replicas (ISR) is a group of replicas that are fully up-to-date with the leader replica.

Purpose

  • Ensures data consistency
  • Provides durability
  • Enables fault tolerance
  • Maintains data safety during failures

Example

For a topic partition with replication factor 3:

  • ISR typically includes: Leader + 2 followers
  • All replicas fully synchronized
  • Redundancy built into the system

Scenario: Leader Failure

Initial State

Payment Topic (Replication Factor 3):

  • Broker 1: Partition 0 Leader
  • Broker 2: Partition 0 Replica 1
  • Broker 3: Partition 0 Replica 2
  • ISR: [Leader, Replica_1, Replica_2]

Producer sends data → Leader → Replicates to both replicas

When Leader Crashes

  1. Failure Detected: Kafka identifies leader is down
  2. Leader Election: Replica_1 elected as new leader
  3. ISR Updated: ISR = [Replica_1 (new leader), Replica_2]
  4. Traffic Redirected: Producer sends data to Replica_1
  5. Replication Continues: Replica_1 replicates to Replica_2

    Result: Seamless transition with minimal downtime

Handling Partition Lag

Scenario

Initial ISR: [Leader, Replica_1, Replica_2]

Why Replicas Fall Behind

Network Latency:

  • Congestion between brokers
  • Temporary slowdown in communication
  • Packet loss

Resource Contention:

  • High CPU usage
  • Disk I/O saturation
  • Memory pressure

Detection Mechanism

Kafka monitors two key metrics:

  1. Time-based: replica.lag.time.max.ms

    - Replica must acknowledge within timeout

    - Default: 10 seconds

  2. Offset-based: Message offset lag

    - Compares replica offset to leader offset

    - Flags if not catching up

    ISR Update Process

  3. Lag Detected: Replica_2 falls behind
  4. ISR Updated: ISR = [Leader, Replica_1] (Replica_2 removed)
  5. Issue Resolved: Network/resource stabilizes
  6. Catch-Up: Replica_2 syncs with leader
  7. Rejoin ISR: Replica_2 added back to ISR

    This dynamic mechanism maintains resilience during temporary disruptions.

Adding a New Replica

Scenario

Current State:

  • ISR: [Leader, Replica_1]
  • Missing: One replica due to network issue
  • Desired: Restore replication factor to 3

Process

  1. Add Broker: Introduce new broker to cluster
  2. Assign Replica: Replica_3 created on new broker
  3. Initial Sync: Replica_3 starts replicating from leader
  4. Catch-Up: Replicates all data to match leader
  5. Full Sync: Replica_3 fully synchronized
  6. ISR Update: ISR = [Leader, Replica_1, Replica_3]

    Result: Replication factor restored, redundancy complete

Acknowledgement Modes

acks=0 (No Acknowledgement)

Behavior: Producer doesn't wait for broker confirmation

Performance: Fastest

Reliability: Lowest

Use Case: Non-critical logs, metrics where some loss is acceptable

Risk: Message loss if broker fails

acks=1 (Leader Acknowledgement)

Behavior: Producer waits for leader confirmation only

Performance: Moderate

Reliability: Moderate

Use Case: Balanced performance and reliability

Risk: Data loss if leader fails before replication

acks=all or -1 (All ISR Acknowledgement)

Behavior: Producer waits for all ISR members to confirm

Performance: Slowest

Reliability: Highest

Use Case: Critical data (financial transactions, user data)

Risk: Higher latency, but no data loss

Configuration

Producer Configuration

properties
1acks=all
2min.insync.replicas=2

Broker Configuration

properties
1replica.lag.time.max.ms=10000
2min.insync.replicas=2

Best Practices

ISR Management

  • Monitor ISR size
  • Alert on ISR shrinkage
  • Investigate lag causes
  • Maintain adequate replication

Acknowledgement Strategy

  • Use acks=all for critical data
  • Set min.insync.replicas ≥ 2
  • Balance latency vs reliability
  • Monitor producer metrics

Replication

  • Replication factor ≥ 3 for production
  • Spread replicas across availability zones
  • Monitor replica lag
  • Plan for broker failures

Monitoring

Key Metrics

  • ISR size per partition
  • Replica lag (time and offset)
  • Under-replicated partitions
  • Producer acknowledgement latency

Alerts

  • ISR shrinkage
  • Replica lag exceeding threshold
  • Under-replicated partitions
  • Producer errors

Summary

ISR and acknowledgements provide:

  • Data durability: Through replication
  • Fault tolerance: Automatic failover
  • Flexibility: Choose reliability vs performance
  • Resilience: Dynamic ISR management

Understanding these concepts is essential for building reliable Kafka-based systems that meet your data consistency and availability requirements.