Data Systems
    Updated May 2026
    Apache Kafka logo

    Apache Kafka Monitoring

    Monitor Apache Kafka UnderReplicatedPartitions, OfflinePartitionsCount, ActiveControllerCount, consumer group lag, ISR shrinks/expands, and tiered-storage segment metrics in real time — via JMX or the OpenTelemetry Kafka receiver.

    Why monitor Apache Kafka?

    Kafka is the event-streaming backbone for CDC pipelines, microservice eventing, and high-throughput data infrastructure. Under-replicated partitions, offline partitions, consumer-group lag, and controller failures cause data loss or downstream microservice cascades — and they're invisible without dedicated monitoring. Catching them at minute resolution is the difference between a clean recovery and a multi-day outage.

    Auto-discovery via Xitogent — broker IDs, topics, partitions, consumer groups detected automatically
    Native JMX MBean reading (kafka.server / kafka.controller / kafka.network namespaces)
    Consumer group lag tracking via `__consumer_offsets` (no per-consumer instrumentation needed)
    `UnderReplicatedPartitions`, `OfflinePartitionsCount`, `UncleanLeaderElectionsPerSec` cluster health
    Per-topic `MessagesInPerSec`, `BytesInPerSec`, `BytesOutPerSec` throughput
    Request latency tracking (Produce, Fetch, Metadata) via RequestMetrics
    KRaft mode (Kafka 3.3+ / mandatory in 4.0) controller-quorum metrics
    Tiered storage segment metrics (Kafka 3.8+ with KIP-405)
    Customizable alert thresholds for every metric
    1-minute metric collection intervals out of the box
    What is Kafka monitoring?

    Kafka monitoring, explained

    Kafka monitoring catches under-replicated partitions, offline partitions, consumer-group lag spikes, ISR shrinks, controller failures, and disk pressure before they cause data loss, downstream microservice failures, or full broker outages. For CDC pipelines, event-sourcing systems, microservice eventing, and any production Kafka cluster, per-broker + per-consumer-group visibility is what separates a 60-second alert on a lagging consumer from finding a 50-million-message backlog at end-of-day. Xitoring auto-discovers your brokers, reads JMX MBeans + consumer offsets, and routes alerts to Slack, PagerDuty, Telegram, or your existing on-call.

    Metrics

    What we monitor

    UnderReplicatedPartitions

    Partitions where ISR (in-sync replicas) < replication factor. Any sustained non-zero value = a broker is lagging, restarting, or dead. The single most important Kafka cluster-health alert.

    UnderMinIsrPartitionCount

    Partitions with ISR below `min.insync.replicas`. When this is non-zero, producers with `acks=all` start failing — the cluster has lost durability guarantees for those partitions.

    OfflinePartitionsCount

    Partitions with no live leader. Any non-zero value = data is unavailable for read or write. Critical alert; usually means controller failure or rapid broker churn.

    ActiveControllerCount

    Should be exactly 1 across the cluster. 0 = no controller (controller election in progress or failed); 2+ = split-brain (catastrophic). Sum the metric across all brokers.

    IsrShrinksPerSec / IsrExpandsPerSec

    Rate of ISR membership changes. Healthy steady state is near zero. Frequent shrink/expand = network instability between brokers or one broker repeatedly falling behind.

    Consumer Group Lag (records_lag_max)

    Max lag across all partitions a consumer group is reading. The headline metric for consumer health — alert when lag > your processing-time budget for downstream effects.

    UncleanLeaderElectionsPerSec

    Rate of out-of-sync replicas being promoted to leader (with data loss). Should be 0 — non-zero means `unclean.leader.election.enable=true` AND a real failure event occurred.

    MessagesInPerSec / BytesIn / BytesOut

    Per-broker and per-topic throughput. Sudden drops with stable producer count = ingest issue; sudden spikes = retry storm or runaway producer.

    Request Latency (p99)

    p99 of Produce, Fetch, Metadata request handler time from `kafka.network:type=RequestMetrics`. Catches broker overload before it causes timeouts at clients.

    LeaderCount per Broker

    Partition leaders per broker. Uneven distribution (one broker holding 60%+ of leaders) = imbalanced cluster, fix with `kafka-reassign-partitions.sh` or.

    Log Size per Topic

    Aggregate on-disk log size per topic from `kafka.log:type=Log,name=Size`. Drives disk-space alerts and informs tiered-storage policies in Kafka 3.8+.

    RemoteLogManager (tiered storage)

    Kafka 3.8+ tiered storage metrics: bytes uploaded to remote tier, segments in remote vs local, fetch latency from remote. Catches S3 connectivity / IAM issues breaking tiered fetches.

    Triggers & Alerts

    Configurable alert triggers

    Set up custom triggers in your dashboard to get notified the moment Apache Kafka metrics cross your defined thresholds.

    Apache Kafka monitoring trigger configuration dashboard

    Consumer Lag

    critical

    Fires when consumer falls behind.

    Under-Replicated Partitions

    critical

    Alerts on replication issues.

    Broker Down

    critical

    Triggers when a broker leaves the cluster.

    Disk Usage

    warning

    Fires when broker disk is filling up.

    01

    Importance of Kafka Monitoring

    Kafka processes trillions of messages daily. Consumer lag, broker failures, and partition imbalance can cause data pipeline failures.

    • Detect consumer lag before data loss
    • Monitor ISR for data durability
    • Track broker health across clusters
    • Ensure partition balance
    Kafka monitoring
    Partition analytics
    02

    Why Choose Xitoring

    Enterprise-grade Kafka monitoring.

    • Zero-config setup
    • Global nodes
    • Unified dashboard
    • Multi-channel alerts
    • Historical retention
    Overview
    Alerts
    Use cases

    Common Kafka monitoring scenarios

    Where Kafka typically runs today — and what could go wrong if no one's watching.

    The messaging backbone connecting your apps

    When Kafka carries the messages that move data between your apps, any slowdown means one app is quietly falling behind — and the consequences (delayed updates, stale data, broken workflows) only show up later. We catch the lag the moment it begins so it never becomes a customer-visible problem.

    Kafka running inside Kubernetes

    When Kafka runs in Kubernetes, the platform constantly moves it around — and a routine restart can briefly weaken the safety net that keeps your data protected. We watch every restart and rebalance so a normal update can't quietly leave the system one failure away from data loss.

    Self-managed Kafka for high-volume data

    Companies running their own Kafka at scale need it to be rock-solid — it's usually carrying the most valuable data they have. We watch the signals that keep it healthy so the team can focus on building products instead of firefighting the messaging layer.

    Before you start

    Prerequisites for Apache Kafka

    Make sure you've got these in place — most installs are a 60-second job once they are.

    • Kafka 3.x or 4.x brokers with JMX enabled (default port 9999) — OR Confluent Platform / Redpanda via shared API
    • Network reachability from Xitogent to each broker's JMX port
    • JMX authentication credentials if security is configured (or anonymous JMX on the internal monitoring network)
    Setup Guide

    Get started in minutes

    1

    Install Xitogent on each broker

    Install the lightweight Xitogent monitoring agent on every Kafka broker you want to monitor.

    curl -s https://xitoring.com/install.sh | sudo bash -s -- --key=YOUR_API_KEY
    2

    Enable JMX on each broker

    Kafka exposes broker metrics through JMX. Set `KAFKA_JMX_OPTS` to enable a JMX listener (typically port 9999) on each broker, reload the service, and confirm the agent host can connect to the JMX port.

    sudo xitogent integrate
    3

    Enable the Kafka integration

    Use the Xitoring dashboard or CLI to enable the Kafka integration. Xitogent auto-discovers broker IDs, topics, and consumer groups across the cluster.

    4

    Configure alert thresholds (optional)

    Set custom thresholds for Consumer Lag, Under-Replicated Partitions, or Broker Down events to catch replication issues and back-pressure before consumers fall behind.

    5

    Verify it's working

    Run this command on the server to confirm Xitogent picked up the integration. Fresh metrics will start streaming to your dashboard within ~30 seconds.

    sudo xitogent status

    Frequently asked questions

    What is Kafka monitoring?
    Kafka monitoring is the continuous collection of broker, controller, network, and consumer-group performance data from JMX MBeans — `UnderReplicatedPartitions`, `OfflinePartitionsCount`, `ActiveControllerCount`, `ISR shrinks/expands`, per-topic throughput, request latency, consumer lag — combined with alerting on threshold breaches. It's how teams catch broker failures, replication issues, and consumer slowdowns before they cause data loss or downstream microservice cascades.
    How do I monitor Kafka consumer lag?
    Two approaches: (1) `kafka-consumer-groups.sh --bootstrap-server --describe --group ` returns per-partition `LAG` (current offset vs end of log); (2) read directly from `__consumer_offsets` topic. Track `records_lag_max` across partitions per consumer group. Alert when lag exceeds your processing-time SLO (varies by use case — could be 10K messages for a CDC pipeline, 100 for a real-time enrichment service).
    What are under-replicated partitions and how do I fix them?
    Under-replicated partitions have fewer ISRs (in-sync replicas) than the replication factor — usually because a broker is dead, lagging, or restarting. Run `kafka-topics.sh --describe --under-replicated-partitions` to list them. Fix the underlying broker, then ISR catches up automatically. If a broker is permanently lost, use `kafka-reassign-partitions.sh` to move replicas to surviving brokers. Sustained UnderReplicatedPartitions > 0 = page on-call.
    How do I monitor Kafka broker JMX metrics with Prometheus?
    Standard pattern: deploy prometheus as a JVM Java agent in each broker (`-javaagent:jmx_prometheus_javaagent.jar=8080:kafka-config.yml`), Prometheus scrapes the exporter's `/metrics` endpoint. The `kafka-config.yml` whitelists MBean → metric mapping. Xitogent reads JMX directly without the exporter, but is compatible with environments that already run for Grafana dashboards.
    What is KRaft mode and how does monitoring change without ZooKeeper?
    KRaft (Kafka Raft) replaces ZooKeeper with an internal Raft-based controller quorum (default since Kafka 3.3, only option in 4.0). Monitoring changes: no ZK ensemble metrics, instead watch the controller-quorum (3 or 5 brokers run as controllers), `kafka.controller:type=KafkaController` MBeans, and `kafka.server:type=raft-metrics`. `ActiveControllerCount` semantics are the same (should be exactly 1 active leader).
    How do I detect Kafka offline partitions?
    `kafka.controller:type=KafkaController,name=OfflinePartitionsCount` — should be 0. Any non-zero value = data unavailable for both read and write. Combined with `UncleanLeaderElectionsPerSec > 0` = the cluster lost data by promoting an out-of-sync replica. List affected partitions via `kafka-topics.sh --describe --unavailable-partitions`.
    How do I monitor a Kafka cluster on Kubernetes (Strimzi)?
    Strimzi operator enables JMX on each broker by default at port `:9999`. Install Xitogent on a K8s node with network access to broker pods, or deploy as a DaemonSet that connects to each pod's JMX endpoint via Service DNS. Watch operator-driven rolling-update events (visible as ISR shrinks) and bound the alert window so routine updates don't page on-call. Per-pod disk usage against the tiered-storage threshold matters here too.
    Kafka vs Redpanda monitoring — what's different?
    Redpanda is wire-compatible with the Kafka API and exposes equivalent metrics, but via Prometheus directly (no JMX — Redpanda is C++, not JVM). Same metric semantics (UnderReplicatedPartitions, ConsumerLag, etc.) but different transport. Xitogent works with both — JMX path for Apache Kafka / Confluent / Strimzi, Prometheus path for Redpanda. The cluster-health alert set is identical.
    What Kafka versions are supported?
    Kafka 3.x (3.8 added tiered storage GA via KIP-405), Kafka 4.0 (KRaft mandatory, no ZooKeeper), Kafka 4.1 (KIP-848 next-gen consumer rebalance protocol with server-side coordinator). Older 2.x works but lacks KRaft and tiered storage. Confluent Platform 7.x / 8.x and Redpanda are compatible via the shared Kafka API.

    Start monitoring Apache Kafka today

    Set up in under 60 seconds. No credit card required. Full metrics from day one.

    Start Free Trial

    Keep exploring

    Related Integrations