Data Systems

Updated May 2026

Apache Kafka Monitoring

Monitor Apache Kafka UnderReplicatedPartitions, OfflinePartitionsCount, ActiveControllerCount, consumer group lag, ISR shrinks/expands, and tiered-storage segment metrics in real time — via JMX or the OpenTelemetry Kafka receiver.

Start Free Trial View Docs

Why monitor Apache Kafka?

Kafka is the event-streaming backbone for CDC pipelines, microservice eventing, and high-throughput data infrastructure. Under-replicated partitions, offline partitions, consumer-group lag, and controller failures cause data loss or downstream microservice cascades — and they're invisible without dedicated monitoring. Catching them at minute resolution is the difference between a clean recovery and a multi-day outage.

Auto-discovery via Xitogent — broker IDs, topics, partitions, consumer groups detected automatically

Native JMX MBean reading (kafka.server / kafka.controller / kafka.network namespaces)

Consumer group lag tracking via `__consumer_offsets` (no per-consumer instrumentation needed)

`UnderReplicatedPartitions`, `OfflinePartitionsCount`, `UncleanLeaderElectionsPerSec` cluster health

Per-topic `MessagesInPerSec`, `BytesInPerSec`, `BytesOutPerSec` throughput

Request latency tracking (Produce, Fetch, Metadata) via RequestMetrics

KRaft mode (Kafka 3.3+ / mandatory in 4.0) controller-quorum metrics

Tiered storage segment metrics (Kafka 3.8+ with KIP-405)

Customizable alert thresholds for every metric

1-minute metric collection intervals out of the box

What is Kafka monitoring?

Kafka monitoring, explained

Kafka monitoring catches under-replicated partitions, offline partitions, consumer-group lag spikes, ISR shrinks, controller failures, and disk pressure before they cause data loss, downstream microservice failures, or full broker outages. For CDC pipelines, event-sourcing systems, microservice eventing, and any production Kafka cluster, per-broker + per-consumer-group visibility is what separates a 60-second alert on a lagging consumer from finding a 50-million-message backlog at end-of-day. Xitoring auto-discovers your brokers, reads JMX MBeans + consumer offsets, and routes alerts to Slack, PagerDuty, Telegram, or your existing on-call.

Metrics

What we monitor

UnderReplicatedPartitions

Partitions where ISR (in-sync replicas) < replication factor. Any sustained non-zero value = a broker is lagging, restarting, or dead. The single most important Kafka cluster-health alert.

UnderMinIsrPartitionCount

Partitions with ISR below `min.insync.replicas`. When this is non-zero, producers with `acks=all` start failing — the cluster has lost durability guarantees for those partitions.

OfflinePartitionsCount

Partitions with no live leader. Any non-zero value = data is unavailable for read or write. Critical alert; usually means controller failure or rapid broker churn.

ActiveControllerCount

Should be exactly 1 across the cluster. 0 = no controller (controller election in progress or failed); 2+ = split-brain (catastrophic). Sum the metric across all brokers.

IsrShrinksPerSec / IsrExpandsPerSec

Rate of ISR membership changes. Healthy steady state is near zero. Frequent shrink/expand = network instability between brokers or one broker repeatedly falling behind.

Consumer Group Lag (records_lag_max)

Max lag across all partitions a consumer group is reading. The headline metric for consumer health — alert when lag > your processing-time budget for downstream effects.

UncleanLeaderElectionsPerSec

Rate of out-of-sync replicas being promoted to leader (with data loss). Should be 0 — non-zero means `unclean.leader.election.enable=true` AND a real failure event occurred.

MessagesInPerSec / BytesIn / BytesOut

Per-broker and per-topic throughput. Sudden drops with stable producer count = ingest issue; sudden spikes = retry storm or runaway producer.

Request Latency (p99)

p99 of Produce, Fetch, Metadata request handler time from `kafka.network:type=RequestMetrics`. Catches broker overload before it causes timeouts at clients.

LeaderCount per Broker

Partition leaders per broker. Uneven distribution (one broker holding 60%+ of leaders) = imbalanced cluster, fix with `kafka-reassign-partitions.sh` or.

Log Size per Topic

Aggregate on-disk log size per topic from `kafka.log:type=Log,name=Size`. Drives disk-space alerts and informs tiered-storage policies in Kafka 3.8+.

RemoteLogManager (tiered storage)

Kafka 3.8+ tiered storage metrics: bytes uploaded to remote tier, segments in remote vs local, fetch latency from remote. Catches S3 connectivity / IAM issues breaking tiered fetches.

Triggers & Alerts

Configurable alert triggers

Set up custom triggers in your dashboard to get notified the moment Apache Kafka metrics cross your defined thresholds.

Apache Kafka monitoring trigger configuration dashboard

Consumer Lag

critical

Fires when consumer falls behind.

Under-Replicated Partitions

critical

Alerts on replication issues.

Broker Down

critical

Triggers when a broker leaves the cluster.

Disk Usage

warning

Fires when broker disk is filling up.

Importance of Kafka Monitoring

Kafka processes trillions of messages daily. Consumer lag, broker failures, and partition imbalance can cause data pipeline failures.

Detect consumer lag before data loss
Monitor ISR for data durability
Track broker health across clusters
Ensure partition balance

Why Choose Xitoring

Enterprise-grade Kafka monitoring.

Zero-config setup
Global nodes
Unified dashboard
Multi-channel alerts
Historical retention

Use cases

Common Kafka monitoring scenarios

Where Kafka typically runs today — and what could go wrong if no one's watching.

The messaging backbone connecting your apps

When Kafka carries the messages that move data between your apps, any slowdown means one app is quietly falling behind — and the consequences (delayed updates, stale data, broken workflows) only show up later. We catch the lag the moment it begins so it never becomes a customer-visible problem.

Kafka running inside Kubernetes

When Kafka runs in Kubernetes, the platform constantly moves it around — and a routine restart can briefly weaken the safety net that keeps your data protected. We watch every restart and rebalance so a normal update can't quietly leave the system one failure away from data loss.

Self-managed Kafka for high-volume data

Companies running their own Kafka at scale need it to be rock-solid — it's usually carrying the most valuable data they have. We watch the signals that keep it healthy so the team can focus on building products instead of firefighting the messaging layer.

Before you start

Prerequisites for Apache Kafka

Make sure you've got these in place — most installs are a 60-second job once they are.

Kafka 3.x or 4.x brokers with JMX enabled (default port 9999) — OR Confluent Platform / Redpanda via shared API
Network reachability from Xitogent to each broker's JMX port
JMX authentication credentials if security is configured (or anonymous JMX on the internal monitoring network)

Setup Guide

Get started in minutes

Install Xitogent on each broker

Install the lightweight Xitogent monitoring agent on every Kafka broker you want to monitor.

curl -s https://xitoring.com/install.sh | sudo bash -s -- --key=YOUR_API_KEY

Enable JMX on each broker

Kafka exposes broker metrics through JMX. Set `KAFKA_JMX_OPTS` to enable a JMX listener (typically port 9999) on each broker, reload the service, and confirm the agent host can connect to the JMX port.

sudo xitogent integrate

Enable the Kafka integration

Use the Xitoring dashboard or CLI to enable the Kafka integration. Xitogent auto-discovers broker IDs, topics, and consumer groups across the cluster.

Configure alert thresholds (optional)

Set custom thresholds for Consumer Lag, Under-Replicated Partitions, or Broker Down events to catch replication issues and back-pressure before consumers fall behind.

Verify it's working

Run this command on the server to confirm Xitogent picked up the integration. Fresh metrics will start streaming to your dashboard within ~30 seconds.

sudo xitogent status

Compare

Considering alternatives?

See how Xitoring stacks up against the alternatives for Apache Kafka monitoring — flat pricing, deeper integrations, and one agent that covers your whole stack.

Xitoring vs

Datadog

Pay-per-host pricing gets expensive fast at scale. See where Xitoring delivers the same coverage on a flat plan.

Xitoring vs

New Relic

Full-stack observability without the enterprise tiers, ingestion fees, or seat-based licensing.

Xitoring vs

Grafana Cloud

One tool with one price instead of stitching Prometheus, Loki, and Grafana into a stack you also have to monitor.

See all comparisons

Frequently asked questions

What is Kafka monitoring?

Kafka monitoring is the continuous collection of broker, controller, network, and consumer-group performance data from JMX MBeans — `UnderReplicatedPartitions`, `OfflinePartitionsCount`, `ActiveControllerCount`, `ISR shrinks/expands`, per-topic throughput, request latency, consumer lag — combined with alerting on threshold breaches. It's how teams catch broker failures, replication issues, and consumer slowdowns before they cause data loss or downstream microservice cascades.

How do I monitor Kafka consumer lag?

Two approaches: (1) `kafka-consumer-groups.sh --bootstrap-server --describe --group ` returns per-partition `LAG` (current offset vs end of log); (2) read directly from `__consumer_offsets` topic. Track `records_lag_max` across partitions per consumer group. Alert when lag exceeds your processing-time SLO (varies by use case — could be 10K messages for a CDC pipeline, 100 for a real-time enrichment service).

What are under-replicated partitions and how do I fix them?

Under-replicated partitions have fewer ISRs (in-sync replicas) than the replication factor — usually because a broker is dead, lagging, or restarting. Run `kafka-topics.sh --describe --under-replicated-partitions` to list them. Fix the underlying broker, then ISR catches up automatically. If a broker is permanently lost, use `kafka-reassign-partitions.sh` to move replicas to surviving brokers. Sustained UnderReplicatedPartitions > 0 = page on-call.

How do I monitor Kafka broker JMX metrics with Prometheus?

Standard pattern: deploy prometheus as a JVM Java agent in each broker (`-javaagent:jmx_prometheus_javaagent.jar=8080:kafka-config.yml`), Prometheus scrapes the exporter's `/metrics` endpoint. The `kafka-config.yml` whitelists MBean → metric mapping. Xitogent reads JMX directly without the exporter, but is compatible with environments that already run for Grafana dashboards.

What is KRaft mode and how does monitoring change without ZooKeeper?

KRaft (Kafka Raft) replaces ZooKeeper with an internal Raft-based controller quorum (default since Kafka 3.3, only option in 4.0). Monitoring changes: no ZK ensemble metrics, instead watch the controller-quorum (3 or 5 brokers run as controllers), `kafka.controller:type=KafkaController` MBeans, and `kafka.server:type=raft-metrics`. `ActiveControllerCount` semantics are the same (should be exactly 1 active leader).

How do I detect Kafka offline partitions?

`kafka.controller:type=KafkaController,name=OfflinePartitionsCount` — should be 0. Any non-zero value = data unavailable for both read and write. Combined with `UncleanLeaderElectionsPerSec > 0` = the cluster lost data by promoting an out-of-sync replica. List affected partitions via `kafka-topics.sh --describe --unavailable-partitions`.

How do I monitor a Kafka cluster on Kubernetes (Strimzi)?

Strimzi operator enables JMX on each broker by default at port `:9999`. Install Xitogent on a K8s node with network access to broker pods, or deploy as a DaemonSet that connects to each pod's JMX endpoint via Service DNS. Watch operator-driven rolling-update events (visible as ISR shrinks) and bound the alert window so routine updates don't page on-call. Per-pod disk usage against the tiered-storage threshold matters here too.

Kafka vs Redpanda monitoring — what's different?

Redpanda is wire-compatible with the Kafka API and exposes equivalent metrics, but via Prometheus directly (no JMX — Redpanda is C++, not JVM). Same metric semantics (UnderReplicatedPartitions, ConsumerLag, etc.) but different transport. Xitogent works with both — JMX path for Apache Kafka / Confluent / Strimzi, Prometheus path for Redpanda. The cluster-health alert set is identical.

What Kafka versions are supported?

Kafka 3.x (3.8 added tiered storage GA via KIP-405), Kafka 4.0 (KRaft mandatory, no ZooKeeper), Kafka 4.1 (KIP-848 next-gen consumer rebalance protocol with server-side coordinator). Older 2.x works but lacks KRaft and tiered storage. Confluent Platform 7.x / 8.x and Redpanda are compatible via the shared Kafka API.

Start monitoring Apache Kafka today

Set up in under 60 seconds. No credit card required. Full metrics from day one.

Start Free Trial

Keep exploring

Related Integrations

RabbitMQ

Redis

Elasticsearch