Data Systems

Updated May 2026

InfluxDB Monitoring

Monitor InfluxDB write throughput, query duration, series cardinality, TSM compaction queue, WAL size, and HTTP error rate in real time — via the native `/metrics` Prometheus endpoint, `_internal` (1.x), or `_monitoring` bucket (2.x).

Start Free Trial View Docs

Why monitor InfluxDB?

InfluxDB is the time-series database behind Grafana dashboards, IoT pipelines, and any TICK-stack deployment. When series cardinality explodes, compactions back up, or queries slow down, dashboards stop loading and alerts go silent. Monitoring catches the failure mode before the gap in graphs makes it obvious.

Auto-discovery via Xitogent — zero manual configuration

Native `/metrics` Prometheus endpoint scraping (InfluxDB OSS 2.x default)

`_internal` database support (InfluxDB 1.x) and `_monitoring` bucket support (InfluxDB 2.x)

Write throughput (points/sec, request rate, errors, drops) per database/bucket

Query duration trending plus `queriesActive` for live load

TSM compaction queue + cache snapshot + WAL size tracking

Series cardinality monitoring (`database.numSeries`, `shard.seriesCreate`)

HTTP status code distribution (1xx/2xx/4xx/5xx) per endpoint

Customizable alert thresholds for every metric

1-minute metric collection intervals out of the box

What is InfluxDB monitoring?

InfluxDB monitoring, explained

InfluxDB monitoring catches write-throughput stalls, runaway series cardinality (the classic InfluxDB 1.x/2.x failure mode), TSM compaction backlogs, query slowdowns, and WAL growth before they cause ingest loss or query timeouts on your Grafana dashboards. For IoT sensor pipelines, application metrics backends, and any TICK-stack deployment, per-database visibility is what separates a 60-second alert from a multi-hour incident chasing missing data points. Xitoring auto-discovers your InfluxDB, reads the native /metrics Prometheus endpoint, and routes alerts to Slack, PagerDuty, Telegram, or your existing on-call.

Metrics

What we monitor

Write Points / sec

Rate of data points written, derived from `write.pointReq`. Spikes flag IoT/Telegraf storms; sudden drops flag broken collection pipelines.

Write Errors & Drops

`write.writeReqErr` (write requests failing) and `write.writeDropped` (points dropped server-side, often from cardinality or schema errors). Any non-zero rate = ingest is losing data.

Query Duration / Active Queries

`queryExecutor.queryDurationNs` (mean query time) and `queriesActive` (concurrent queries in flight). Both spike together during dashboard refresh storms.

Series Cardinality

`database.numSeries` per DB (1.x/2.x). The single most important InfluxDB health metric — runaway cardinality causes OOM, slow queries, and full storage exhaustion. Alert above your baseline + 50%.

TSM Compaction Queue

`storage.tsm1.compactions.*` Level 1/2/3/Full queue depths. Sustained non-zero queues mean writes are outpacing compactions — query performance degrades as small files accumulate.

TSM Cache

`storage.tsm1.cache.cachedBytes` (in-memory write buffer), `snapshotCount` (pending flushes), `WALCompactionTimeMs`. Cache growth without snapshot drain = compactor falling behind.

WAL Disk Bytes

`storage.tsm1.wal.currentSegmentDiskBytes` + `oldSegmentsDiskBytes`. WAL growth without TSM consolidation means recovery time will balloon on restart.

Storage Size on Disk

`storage.tsm1.filestore.diskBytes` + numFiles per shard. Track against your retention policy — high file counts at the same data size flag fragmentation.

HTTP 4xx / 5xx Rate

`httpd.clientError` + `httpd.serverError` (or Prometheus `http_api_request_errors_total`). 4xx spikes flag client schema/auth bugs; 5xx flags server-side failures.

Connections / Auth Failures

`httpd.req` (total HTTP requests), `httpd.authFail` (failed auth attempts), `httpd.pingReq`. Auth failure spikes signal misconfigured Telegraf or credential rotation gone wrong.

Runtime — Goroutines & GC

Go runtime stats: `runtime.NumGoroutine` (goroutine leak detection), `runtime.HeapAlloc` (live heap), `runtime.NumGC`/`PauseTotalNs` (GC pressure). Catch leaks and pause-time regressions before OOM.

Subscription Writes

`subscriber.pointsWritten` and `subscriber.writeFailures` — when Kapacitor or downstream pipelines consume via subscriptions, this is how you catch their backpressure.

Triggers & Alerts

Configurable alert triggers

Set up custom triggers in your dashboard to get notified the moment InfluxDB metrics cross your defined thresholds.

InfluxDB monitoring trigger configuration dashboard

Write Throughput

warning

Fires on write rate anomalies.

Query Duration

warning

Alerts on slow queries.

Series Cardinality

critical

Triggers when cardinality is too high.

Storage Size

critical

Fires when storage exceeds threshold.

Importance of InfluxDB Monitoring

InfluxDB handles high-velocity time-series data. High cardinality, write pressure, and compaction delays can degrade performance.

Track write throughput for ingestion health
Monitor series cardinality to prevent OOM
Detect slow queries early
Ensure compaction keeps up

Why Choose Xitoring

Zero-config InfluxDB monitoring.

One-command install
Global nodes
Unified dashboard
Multi-channel alerts
Historical retention

Use cases

Common InfluxDB monitoring scenarios

Where InfluxDB typically runs today — and what could go wrong if no one's watching.

The database behind your team's dashboards

When dashboards in Grafana or another tool feel slow, the cause is often the database underneath — not the dashboard itself. We surface where the slowness actually lives so the team fixes the right thing instead of chasing the symptom.

Data flowing in from sensors and devices

Connected devices, factory equipment, and IoT sensors send measurements every second of every day. A silent backup in the pipeline means lost data — and lost data is gone forever. We watch the flow end to end so a single dropped reading raises the alarm.

App and infrastructure metrics in one place

When the same database holds both app metrics and server metrics, a problem with the database hides every signal at once. We watch the database itself so the team's own monitoring never goes dark during an incident.

Before you start

Prerequisites for InfluxDB

Make sure you've got these in place — most installs are a 60-second job once they are.

InfluxDB 1.x, 2.x, or 3.0 (FDAP) running on the server
InfluxDB HTTP port reachable from Xitogent (default 8086, or 8181 on InfluxDB 3.0 Core)
Optional: a read-only token if InfluxDB 2.x/3.0 authentication is enabled (operator or all-access scopes both work for /metrics)

Setup Guide

Get started in minutes

Install Xitogent on your InfluxDB host

Install the lightweight Xitogent monitoring agent on the host running InfluxDB.

curl -s https://xitoring.com/install.sh | sudo bash -s -- --key=YOUR_API_KEY

Confirm InfluxDB is reachable

Verify InfluxDB is listening on its HTTP port (default 8086) and reachable from the host running Xitogent. Xitogent will prompt for host and port during integrate — no extra config edits or endpoint exposure are required.

sudo xitogent integrate

Enable the InfluxDB integration

Use the Xitoring dashboard or CLI to enable the InfluxDB integration. Xitogent auto-detects your InfluxDB version and starts collecting write, query, and storage metrics.

Configure alert thresholds (optional)

Set custom thresholds for Write Throughput, Query Duration, or Series Cardinality to catch ingest pressure and runaway tag growth before queries slow down.

Verify it's working

Run this command on the server to confirm Xitogent picked up the integration. Fresh metrics will start streaming to your dashboard within ~30 seconds.

sudo xitogent status

Compare

Considering alternatives?

See how Xitoring stacks up against the alternatives for InfluxDB monitoring — flat pricing, deeper integrations, and one agent that covers your whole stack.

Xitoring vs

Datadog

Pay-per-host pricing gets expensive fast at scale. See where Xitoring delivers the same coverage on a flat plan.

Xitoring vs

New Relic

Full-stack observability without the enterprise tiers, ingestion fees, or seat-based licensing.

Xitoring vs

Grafana Cloud

One tool with one price instead of stitching Prometheus, Loki, and Grafana into a stack you also have to monitor.

See all comparisons

Frequently asked questions

What is InfluxDB monitoring?

InfluxDB monitoring is the continuous collection of InfluxDB performance data — write throughput (points/sec, errors, drops), query duration and active query count, TSM compaction state, WAL size, series cardinality, HTTP status code distribution, and Go runtime stats — combined with alerting when those metrics breach thresholds. The data comes from the native `/metrics` Prometheus endpoint, the `_internal` database (1.x), or the `_monitoring` bucket (2.x).

How do I monitor InfluxDB write throughput?

Track `write.pointReq` (or Prometheus `storage_writer_points_total`) for the rate of points written, plus `write.writeReqErr` (failed write requests) and `write.writeDropped` (points dropped server-side, usually from schema or cardinality errors). Healthy ingest = pointReq rate stable with near-zero errors. Sudden drops in pointReq with high writeReqErr means clients are submitting bad schema; high writeDropped with normal request rate means the storage engine is rejecting points.

How do I detect InfluxDB cardinality issues?

Series cardinality is THE InfluxDB 1.x/2.x failure mode — too many unique tag combinations cause OOM and slow queries. Run `SHOW SERIES CARDINALITY` (InfluxQL) or `import "influxdata/influxdb/v1" v1.cardinality(...)` (Flux), or read `database.numSeries` from `_internal`. Alert on any 50%+ jump from baseline — that's almost always a tag-value explosion from an unintended high-cardinality field (request IDs, timestamps as tags, user IDs).

What is the _internal database in InfluxDB?

`_internal` is the special database InfluxDB 1.x writes its own metrics to — same TSM storage as user data, queryable via `USE _internal` + `SHOW MEASUREMENTS`. Contains measurements like `write`, `queryExecutor`, `tsm1_engine`, `tsm1_cache`, `tsm1_wal`, `httpd`, `runtime`, `database`, `shard`. In InfluxDB 2.x, this moved to the `_monitoring` bucket (set up by the Monitoring template). In 3.0, the `/metrics` Prometheus endpoint is the canonical surface.

How do I monitor InfluxDB compactions?

TSM compactions merge small WAL/cache files into larger optimized files at three levels (L1/L2/L3) plus full compactions. Watch `storage.tsm1.compactions.cacheCompactionDuration`, `tsmLevel{1,2,3}CompactionQueue` (queue depth — non-zero means backlog), and `tsmLevel{1,2,3}CompactionDuration`. A growing queue with normal write rate = compactor falling behind = query degradation imminent. Either scale up or reduce write rate.

What's the difference between InfluxDB 1.x, 2.x, and 3.0 monitoring?

1.x uses InfluxQL + TICK stack, exposes `/debug/vars` and the `_internal` database, runs the TSM/TSI storage engine. 2.x uses Flux + tasks, exposes `/metrics` (Prometheus) and the `_monitoring` bucket, same TSM/TSI underneath. 3.0 is the new FDAP architecture — DataFusion query engine, Parquet storage on object stores, removed the cardinality limit entirely, supports SQL alongside InfluxQL (Flux is in maintenance mode). Xitogent auto-detects the version and adapts.

How do I detect InfluxDB query slowness?

Track `queryExecutor.queryDurationNs` (mean query time) and `queriesActive` (concurrent in-flight queries). Spikes during dashboard refreshes are expected; sustained growth means queries are getting slower (often cardinality-driven or compaction-backlog-driven). Enable the slow query log (`log-queries-after = '5s'` in `influxdb.conf` for 1.x) to capture specific offenders for investigation.

How do I monitor InfluxDB TSM storage?

TSM (Time-Structured Merge tree) is the on-disk storage engine for 1.x/2.x. Monitor `storage.tsm1.filestore.diskBytes` (total on-disk size) and `numFiles` (file count — high numbers at the same bytes = fragmentation). Pair with `storage.tsm1.cache.cachedBytes` (in-memory write buffer) and WAL size. Sustained WAL growth without TSM consolidation = compactor problem; ballooning numFiles = retention/compaction not keeping up with writes.

Will this integration affect InfluxDB performance?

No measurable impact. Xitogent reads from the native `/metrics` Prometheus endpoint (or `_internal` / `_monitoring` query views) on a 1-minute interval — the same lightweight mechanism InfluxData's own tools use. No instrumentation injected into the write path or query engine.

Start monitoring InfluxDB today

Set up in under 60 seconds. No credit card required. Full metrics from day one.

Start Free Trial

Keep exploring

Related Integrations

PostgreSQL

Redis

Elasticsearch