
Apache Kafka® is a distributed event streaming platform designed for high throughput and fault tolerance. At the heart of its durability and scalability lies the concept of partition replication—a mechanism that ensures data remains available and consistent even in the face of broker failures. In this article, we’ll explore how partition replication works in Kafka, why it's critical for maintaining system reliability, and what can go wrong if it's not configured properly. Whether you're new to Kafka or looking to deepen your understanding, this guide will help clarify the role of replication in building robust data pipelines.
When you create a new topic, the most important configuration to consider is the replication factor to use. Replication factor in Apache Kafka refers to the number of copies (replicas) of each partition that are maintained across different brokers in a cluster. If you create a topic with the replication factor of three, Kafka will create three replicas - one leader and two followers - as can be seen in the picture below. In this example we only show one partition for the sake of simplicity, but typically your topics would have more than one partition.
Another important configuration to consider during topic creation is the min.insync.replicas setting. This is a setting in Apache Kafka that defines the minimum number of replicas that must be in-sync for a producer to successfully write a message. If the number of in-sync replicas drops below this value, the producer will receive an error (NotEnoughReplicasException or NotEnoughReplicasAfterAppendException), and the write will be rejected (if the producer’s acks setting is 'all' or -1).
Typically the value of min.insync.replicas is set to replication factor - 1, i.e. if replication factor is 3 then min.insync.replicas should be 2.
A partition is unavailable if it has no leader. This can be due to broker failures, network issues, or replica synchronization problems. Usually a new leader is elected quickly, but in some cases the election process can take a long time or cannot be completed at all (for example if there are no in-sync replicas). When a partition is unavailable, Kafka cannot handle reads or writes for that partition, is it is imperative that the issue gets resolved promptly.
In the picture below, broker 1 has gone offline so the partition has no leader. If broker 2 or 3 cannot be elected to be the new leader, the partition will remain unavailable until the issue is resolved.
An under-replicated partition in Kafka is a partition where not all replicas are in sync with the leader — meaning the number of in-sync replicas (ISR) is less than the total number of assigned replicas. An under-replicated partition occurs when some follower(s) fall behind or go offline.
The picture below shows broker 3 going offline, so the partition is under-replicated as one of the replicas is on the offline broker. Kafka will serve reads and writes for the partition, as there are two replicas that are still in sync and the min.insync.replica setting is set to two.
An "under minimum ISR partition" is a Kafka partition where the number of in-sync replicas (ISR) has dropped below the configured min.insync.replicas setting. This means that Kafka will reject writes to this partition if acks=all is used in the producer. Below you can find an explanation for the acks producer setting.
The picture below shows brokers 2 and 3 going offline, so the partition now has just one in-sync replica (the leader). In other words it has less than min.insync.replica in sync partitions, so the writes will be rejected. Reads may still work so the condition might be harder to detect than an unavailable partition. Fortunately you can create Alerts for this metric in gradient fox to make sure you will be notified immediately when this happens.