Kafka Architecture: Brokers, Topics, Partitions, and Replicas
The Cluster: Brokers and the Controller
A Kafka cluster is a group of servers, each called a broker. Brokers store data and serve producer/consumer requests. One broker in the cluster acts as the controller — it manages partition leadership, handles broker joins and departures, and coordinates rebalancing.
In KRaft mode (Kafka 3.3+, the default from Kafka 4.0), the controller is built into Kafka itself — no ZooKeeper needed.
flowchart TB
subgraph Cluster["Kafka Cluster (KRaft mode)"]
direction TB
B1["Broker 1\n⭐ Controller\n(manages metadata)"]
B2["Broker 2\n(data broker)"]
B3["Broker 3\n(data broker)"]
B1 <-->|Raft consensus| B2
B1 <-->|Raft consensus| B3
B2 <-->|Raft consensus| B3
end
Producer["Producer"] -->|send records| B1
Producer -->|send records| B2
Consumer["Consumer"] -->|fetch records| B2
Consumer -->|fetch records| B3
Each broker is identified by a numeric node.id (e.g. 1, 2, 3). Producers and consumers connect to any broker and are redirected to the correct one automatically.
Topics: Named Event Streams
A topic is a named, append-only log. Think of it as a category or feed: orders, payments, inventory-updates. Producers write to topics; consumers read from them.
flowchart LR
OrderSvc["Order Service\n(Producer)"]
InvSvc["Inventory Service\n(Consumer)"]
PaySvc["Payment Service\n(Consumer)"]
NotifSvc["Notification Service\n(Consumer)"]
OrderSvc -->|publish| orders[/"Topic: orders"/]
orders --> InvSvc
orders --> PaySvc
orders --> NotifSvc
Topics are not tied to a single broker. Their data is spread across multiple brokers via partitions.
Partitions: The Unit of Parallelism
Every topic is divided into partitions. A partition is an ordered, immutable sequence of records. Each record in a partition has a unique sequential number called an offset.
flowchart LR
subgraph P0["Partition 0 — Broker 1 (Leader)"]
direction LR
R0["offset 0\ncust-1\nOrderPlaced"] --> R1["offset 1\ncust-3\nOrderPlaced"] --> R2["offset 2\ncust-1\nOrderCancelled"] --> R3["offset 3\n..."] --> HEAD0["← HEAD"]
end
subgraph P1["Partition 1 — Broker 2 (Leader)"]
direction LR
S0["offset 0\ncust-2"] --> S1["offset 1\ncust-4"] --> S2["offset 2\ncust-2"] --> S3["..."] --> HEAD1["← HEAD"]
end
subgraph P2["Partition 2 — Broker 3 (Leader)"]
direction LR
T0["offset 0\ncust-5"] --> T1["offset 1\ncust-6"] --> T2["offset 2\n..."] --> HEAD2["← HEAD"]
end
Key properties of partitions:
- Ordering is guaranteed within a partition — offset 3 always comes after offset 2 on Partition 0
- Ordering is NOT guaranteed across partitions — a record on Partition 0 and a record on Partition 1 have no relative ordering guarantee
- A partition lives on exactly one broker at a time (the leader)
How Records Are Routed to Partitions
flowchart TD
Record["Record\nkey: cust-42\nvalue: OrderPlaced"]
HasKey{Has key?}
HashKey["hash(key) % numPartitions\n→ always same partition\nfor same key"]
RoundRobin["Round-robin across partitions\n(or sticky batch)"]
P0["Partition 0"]
P1["Partition 1"]
P2["Partition 2"]
Record --> HasKey
HasKey -->|Yes| HashKey
HasKey -->|No| RoundRobin
HashKey -->|"cust-42 → P1"| P1
RoundRobin --> P0
RoundRobin --> P2
Using a key (e.g. customerId or orderId) ensures all events for the same entity land on the same partition — preserving per-entity ordering. This is the primary reason to use message keys.
Replicas: Fault Tolerance
Every partition can have multiple replicas — copies stored on different brokers. One replica is the leader (handles all reads and writes); the others are followers (replicate from the leader).
flowchart LR
subgraph B1["Broker 1"]
P0L["Partition 0\n★ Leader"]
P1F["Partition 1\n Follower"]
P2F["Partition 2\n Follower"]
end
subgraph B2["Broker 2"]
P0F["Partition 0\n Follower"]
P1L["Partition 1\n★ Leader"]
P2F2["Partition 2\n Follower"]
end
subgraph B3["Broker 3"]
P0F2["Partition 0\n Follower"]
P1F2["Partition 1\n Follower"]
P2L["Partition 2\n★ Leader"]
end
Producer -->|write| P0L
Producer -->|write| P1L
P0L -->|replicate| P0F
P0L -->|replicate| P0F2
P1L -->|replicate| P1F
P1L -->|replicate| P1F2
With replication.factor=3, each partition has 3 copies spread across 3 brokers. The cluster can survive the loss of any 2 brokers and still serve every partition (from the surviving leader).
In-Sync Replicas (ISR)
An In-Sync Replica (ISR) is a follower that is fully caught up with the leader. The leader tracks the ISR set. A write is acknowledged to the producer only after all ISR members have written the record (when acks=all).
sequenceDiagram
participant Producer
participant Leader as Partition 0 Leader\n(Broker 1)
participant F1 as Follower\n(Broker 2, ISR)
participant F2 as Follower\n(Broker 3, ISR)
Producer->>Leader: ProduceRequest (acks=all)
Leader->>Leader: Write to local log
Leader->>F1: Replicate
Leader->>F2: Replicate
F1-->>Leader: Ack
F2-->>Leader: Ack
Leader-->>Producer: ProduceResponse (offset=42)
Note over Leader,F2: Write acknowledged only after\nall ISR members confirm
What Happens When a Broker Fails
When a broker goes down, the controller detects the failure (via missed heartbeats) and elects a new leader for each partition that lost its leader.
sequenceDiagram
participant Controller
participant B2 as Broker 2\n(was follower, now new leader)
participant B3 as Broker 3\n(follower)
participant Producer
participant Consumer
Note over Controller: Broker 1 (old leader) stops heartbeating
Controller->>Controller: Detect broker failure
Controller->>B2: LeaderAndIsr: YOU are new leader for Partition 0
Controller->>B3: LeaderAndIsr: update metadata
B2->>B2: Become Partition 0 leader
Producer->>B2: ProduceRequest (reconnects automatically)
Consumer->>B2: FetchRequest (reconnects automatically)
Note over Producer,Consumer: Failover is transparent\n(seconds, not minutes)
Producers and consumers fetch updated metadata from any available broker and reconnect to the new leader automatically. The failover takes seconds.
Log Retention: Why Kafka Is Not a Queue
Kafka retains records based on two configurable policies, applied whichever comes first:
flowchart LR
subgraph Policy["Retention Policy (per topic)"]
Time["Time-based\nretention.ms\ne.g. 7 days"]
Size["Size-based\nretention.bytes\ne.g. 10 GB"]
end
subgraph Partition["Partition Log"]
S1["Segment 1\n(old — eligible\nfor deletion)"]
S2["Segment 2"]
S3["Segment 3\n(active — being\nwritten to)"]
end
Time -->|triggers| S1
Size -->|triggers| S1
S1 -->|deleted by\ncleaner thread| Deleted["🗑 Deleted"]
Kafka stores data in segments (files on disk). Old segments are deleted when retention limits are exceeded. Active segments are open for writes.
Log compaction is an alternative to time/size retention: instead of deleting old segments wholesale, Kafka retains the latest record for each key. This is used for topics that represent current state (e.g. user profile updates) — consumers can always reconstruct the current state by reading the compacted topic.
The Record Structure
Every Kafka record (event) has:
classDiagram
class KafkaRecord {
+String topic
+int partition
+long offset
+long timestamp
+byte[] key
+byte[] value
+Headers headers
}
class Headers {
+Header[] headerArray
}
class Header {
+String key
+byte[] value
}
KafkaRecord "1" --> "1" Headers
Headers "1" --> "*" Header
- key: optional bytes used for partition routing and log compaction
- value: the payload (your event data, serialized as bytes)
- headers: optional metadata key-value pairs (trace IDs, schema version, source service)
- timestamp: either producer-set (
CreateTime) or broker-set (LogAppendTime)
Partition Count and Consumer Parallelism
The number of partitions directly controls the maximum parallelism for consumers:
flowchart TB
subgraph Topic["Topic: orders — 3 partitions"]
P0["Partition 0"]
P1["Partition 1"]
P2["Partition 2"]
end
subgraph CG["Consumer Group: inventory-service"]
C1["Consumer 1\n(thread 1)"]
C2["Consumer 2\n(thread 2)"]
C3["Consumer 3\n(thread 3)"]
end
P0 --> C1
P1 --> C2
P2 --> C3
Note1["3 partitions = 3 consumers max\nAdding a 4th consumer → idle\n(no partition to assign)"]
Rule: maximum consumer parallelism = number of partitions. Adding more consumer instances than partitions results in idle consumers. Scale partitions first, then consumers.
Choosing partition count:
- Too few: limits throughput and consumer parallelism
- Too many: more overhead (open files, replication traffic, rebalance time)
- Practical starting point:
max(desired parallelism, target throughput / single-partition throughput) - Common values: 6, 12, 24 for business topics; larger for high-volume ingestion
Key Takeaways
- A Kafka cluster is a group of brokers; one broker acts as controller (via Raft in KRaft mode)
- Topics are append-only logs split into partitions — partitions are the unit of parallelism and ordering
- Records with the same key always land on the same partition, guaranteeing per-key ordering
- Each partition has a leader (one broker) and zero or more followers (replicas on other brokers)
replication.factor=3means each partition has 3 copies — the cluster survives 2 broker failures- ISR (In-Sync Replicas) defines which followers are eligible to become leader
- Kafka retains data by time or size — consumers can replay from any offset within the retention window
- Maximum consumer parallelism = number of partitions
Next: Consumer Groups, Offsets, and the __consumer_offsets Topic — how multiple consumers share a topic, how offsets are committed, and what happens on restart.