Kafka Architecture: Brokers, Topics, Partitions, and Replicas

The Cluster: Brokers and the Controller

A Kafka cluster is a group of servers, each called a broker. Brokers store data and serve producer/consumer requests. One broker in the cluster acts as the controller — it manages partition leadership, handles broker joins and departures, and coordinates rebalancing.

In KRaft mode (Kafka 3.3+, the default from Kafka 4.0), the controller is built into Kafka itself — no ZooKeeper needed.

flowchart TB
    subgraph Cluster["Kafka Cluster (KRaft mode)"]
        direction TB
        B1["Broker 1\n⭐ Controller\n(manages metadata)"]
        B2["Broker 2\n(data broker)"]
        B3["Broker 3\n(data broker)"]
        B1 <-->|Raft consensus| B2
        B1 <-->|Raft consensus| B3
        B2 <-->|Raft consensus| B3
    end

    Producer["Producer"] -->|send records| B1
    Producer -->|send records| B2
    Consumer["Consumer"] -->|fetch records| B2
    Consumer -->|fetch records| B3

Each broker is identified by a numeric node.id (e.g. 1, 2, 3). Producers and consumers connect to any broker and are redirected to the correct one automatically.


Topics: Named Event Streams

A topic is a named, append-only log. Think of it as a category or feed: orders, payments, inventory-updates. Producers write to topics; consumers read from them.

flowchart LR
    OrderSvc["Order Service\n(Producer)"]
    InvSvc["Inventory Service\n(Consumer)"]
    PaySvc["Payment Service\n(Consumer)"]
    NotifSvc["Notification Service\n(Consumer)"]

    OrderSvc -->|publish| orders[/"Topic: orders"/]
    orders --> InvSvc
    orders --> PaySvc
    orders --> NotifSvc

Topics are not tied to a single broker. Their data is spread across multiple brokers via partitions.


Partitions: The Unit of Parallelism

Every topic is divided into partitions. A partition is an ordered, immutable sequence of records. Each record in a partition has a unique sequential number called an offset.

flowchart LR
    subgraph P0["Partition 0 — Broker 1 (Leader)"]
        direction LR
        R0["offset 0\ncust-1\nOrderPlaced"] --> R1["offset 1\ncust-3\nOrderPlaced"] --> R2["offset 2\ncust-1\nOrderCancelled"] --> R3["offset 3\n..."] --> HEAD0["← HEAD"]
    end
    subgraph P1["Partition 1 — Broker 2 (Leader)"]
        direction LR
        S0["offset 0\ncust-2"] --> S1["offset 1\ncust-4"] --> S2["offset 2\ncust-2"] --> S3["..."] --> HEAD1["← HEAD"]
    end
    subgraph P2["Partition 2 — Broker 3 (Leader)"]
        direction LR
        T0["offset 0\ncust-5"] --> T1["offset 1\ncust-6"] --> T2["offset 2\n..."] --> HEAD2["← HEAD"]
    end

Key properties of partitions:

  • Ordering is guaranteed within a partition — offset 3 always comes after offset 2 on Partition 0
  • Ordering is NOT guaranteed across partitions — a record on Partition 0 and a record on Partition 1 have no relative ordering guarantee
  • A partition lives on exactly one broker at a time (the leader)

How Records Are Routed to Partitions

flowchart TD
    Record["Record\nkey: cust-42\nvalue: OrderPlaced"]
    HasKey{Has key?}
    HashKey["hash(key) % numPartitions\n→ always same partition\nfor same key"]
    RoundRobin["Round-robin across partitions\n(or sticky batch)"]
    P0["Partition 0"]
    P1["Partition 1"]
    P2["Partition 2"]

    Record --> HasKey
    HasKey -->|Yes| HashKey
    HasKey -->|No| RoundRobin
    HashKey -->|"cust-42 → P1"| P1
    RoundRobin --> P0
    RoundRobin --> P2

Using a key (e.g. customerId or orderId) ensures all events for the same entity land on the same partition — preserving per-entity ordering. This is the primary reason to use message keys.


Replicas: Fault Tolerance

Every partition can have multiple replicas — copies stored on different brokers. One replica is the leader (handles all reads and writes); the others are followers (replicate from the leader).

flowchart LR
    subgraph B1["Broker 1"]
        P0L["Partition 0\n★ Leader"]
        P1F["Partition 1\n  Follower"]
        P2F["Partition 2\n  Follower"]
    end
    subgraph B2["Broker 2"]
        P0F["Partition 0\n  Follower"]
        P1L["Partition 1\n★ Leader"]
        P2F2["Partition 2\n  Follower"]
    end
    subgraph B3["Broker 3"]
        P0F2["Partition 0\n  Follower"]
        P1F2["Partition 1\n  Follower"]
        P2L["Partition 2\n★ Leader"]
    end

    Producer -->|write| P0L
    Producer -->|write| P1L
    P0L -->|replicate| P0F
    P0L -->|replicate| P0F2
    P1L -->|replicate| P1F
    P1L -->|replicate| P1F2

With replication.factor=3, each partition has 3 copies spread across 3 brokers. The cluster can survive the loss of any 2 brokers and still serve every partition (from the surviving leader).

In-Sync Replicas (ISR)

An In-Sync Replica (ISR) is a follower that is fully caught up with the leader. The leader tracks the ISR set. A write is acknowledged to the producer only after all ISR members have written the record (when acks=all).

sequenceDiagram
    participant Producer
    participant Leader as Partition 0 Leader\n(Broker 1)
    participant F1 as Follower\n(Broker 2, ISR)
    participant F2 as Follower\n(Broker 3, ISR)

    Producer->>Leader: ProduceRequest (acks=all)
    Leader->>Leader: Write to local log
    Leader->>F1: Replicate
    Leader->>F2: Replicate
    F1-->>Leader: Ack
    F2-->>Leader: Ack
    Leader-->>Producer: ProduceResponse (offset=42)

    Note over Leader,F2: Write acknowledged only after\nall ISR members confirm

What Happens When a Broker Fails

When a broker goes down, the controller detects the failure (via missed heartbeats) and elects a new leader for each partition that lost its leader.

sequenceDiagram
    participant Controller
    participant B2 as Broker 2\n(was follower, now new leader)
    participant B3 as Broker 3\n(follower)
    participant Producer
    participant Consumer

    Note over Controller: Broker 1 (old leader) stops heartbeating
    Controller->>Controller: Detect broker failure
    Controller->>B2: LeaderAndIsr: YOU are new leader for Partition 0
    Controller->>B3: LeaderAndIsr: update metadata
    B2->>B2: Become Partition 0 leader
    Producer->>B2: ProduceRequest (reconnects automatically)
    Consumer->>B2: FetchRequest (reconnects automatically)
    Note over Producer,Consumer: Failover is transparent\n(seconds, not minutes)

Producers and consumers fetch updated metadata from any available broker and reconnect to the new leader automatically. The failover takes seconds.


Log Retention: Why Kafka Is Not a Queue

Kafka retains records based on two configurable policies, applied whichever comes first:

flowchart LR
    subgraph Policy["Retention Policy (per topic)"]
        Time["Time-based\nretention.ms\ne.g. 7 days"]
        Size["Size-based\nretention.bytes\ne.g. 10 GB"]
    end

    subgraph Partition["Partition Log"]
        S1["Segment 1\n(old — eligible\nfor deletion)"]
        S2["Segment 2"]
        S3["Segment 3\n(active — being\nwritten to)"]
    end

    Time -->|triggers| S1
    Size -->|triggers| S1
    S1 -->|deleted by\ncleaner thread| Deleted["🗑 Deleted"]

Kafka stores data in segments (files on disk). Old segments are deleted when retention limits are exceeded. Active segments are open for writes.

Log compaction is an alternative to time/size retention: instead of deleting old segments wholesale, Kafka retains the latest record for each key. This is used for topics that represent current state (e.g. user profile updates) — consumers can always reconstruct the current state by reading the compacted topic.


The Record Structure

Every Kafka record (event) has:

classDiagram
    class KafkaRecord {
        +String topic
        +int partition
        +long offset
        +long timestamp
        +byte[] key
        +byte[] value
        +Headers headers
    }

    class Headers {
        +Header[] headerArray
    }

    class Header {
        +String key
        +byte[] value
    }

    KafkaRecord "1" --> "1" Headers
    Headers "1" --> "*" Header
  • key: optional bytes used for partition routing and log compaction
  • value: the payload (your event data, serialized as bytes)
  • headers: optional metadata key-value pairs (trace IDs, schema version, source service)
  • timestamp: either producer-set (CreateTime) or broker-set (LogAppendTime)

Partition Count and Consumer Parallelism

The number of partitions directly controls the maximum parallelism for consumers:

flowchart TB
    subgraph Topic["Topic: orders — 3 partitions"]
        P0["Partition 0"]
        P1["Partition 1"]
        P2["Partition 2"]
    end

    subgraph CG["Consumer Group: inventory-service"]
        C1["Consumer 1\n(thread 1)"]
        C2["Consumer 2\n(thread 2)"]
        C3["Consumer 3\n(thread 3)"]
    end

    P0 --> C1
    P1 --> C2
    P2 --> C3

    Note1["3 partitions = 3 consumers max\nAdding a 4th consumer → idle\n(no partition to assign)"]

Rule: maximum consumer parallelism = number of partitions. Adding more consumer instances than partitions results in idle consumers. Scale partitions first, then consumers.

Choosing partition count:

  • Too few: limits throughput and consumer parallelism
  • Too many: more overhead (open files, replication traffic, rebalance time)
  • Practical starting point: max(desired parallelism, target throughput / single-partition throughput)
  • Common values: 6, 12, 24 for business topics; larger for high-volume ingestion

Key Takeaways

  • A Kafka cluster is a group of brokers; one broker acts as controller (via Raft in KRaft mode)
  • Topics are append-only logs split into partitions — partitions are the unit of parallelism and ordering
  • Records with the same key always land on the same partition, guaranteeing per-key ordering
  • Each partition has a leader (one broker) and zero or more followers (replicas on other brokers)
  • replication.factor=3 means each partition has 3 copies — the cluster survives 2 broker failures
  • ISR (In-Sync Replicas) defines which followers are eligible to become leader
  • Kafka retains data by time or size — consumers can replay from any offset within the retention window
  • Maximum consumer parallelism = number of partitions

Next: Consumer Groups, Offsets, and the __consumer_offsets Topic — how multiple consumers share a topic, how offsets are committed, and what happens on restart.