Containers & System Health

Updated May 2026

Disk Health Monitoring

Monitor SMART attributes (Reallocated_Sector_Ct, Current_Pending_Sector_Ct, Offline_Uncorrectable, Temperature, UDMA_CRC_Error_Count) plus NVMe percentage_used, available_spare, and critical_warning in real time.

Start Free Trial View Docs

Why monitor Disk Health?

Disk failures cause more data loss than any other hardware issue — and SMART data predicts most of them days or weeks ahead. Reallocated sectors creep up, NVMe spare capacity drops, temperatures spike. Monitoring those signals is the highest-ROI alert you can set: catch the failing drive while you still have time to migrate data and swap it without downtime.

SMART attribute monitoring (HDD and SSD) via smartctl

NVMe-cli health log integration (`available_spare`, `percentage_used`, `critical_warning`)

Backblaze 5 failure-predicting attributes pre-configured

RAID controller passthrough support (most LSI/Broadcom, HP Smart Array, megaraid)

Temperature tracking with per-drive thresholds

Self-test history monitoring (short/long/conveyance results)

SSD wear-out + endurance tracking (`Total_LBAs_Written`, `Wear_Leveling_Count`)

Per-drive alert thresholds (warn vs critical)

Native Linux and Windows agent support

1-minute metric collection intervals out of the box

What is disk health monitoring?

Disk health monitoring, explained

Disk health monitoring catches reallocated-sector growth, NVMe wear-out, temperature spikes, and impending failure indicators days or weeks before drives die — long enough to migrate data and swap the drive without downtime. For database servers, backup hosts, and any workload where drive failure means data loss, SMART monitoring is the single highest-ROI alert you can set. Xitoring runs smartctl + nvme-cli locally and routes alerts to Slack, PagerDuty, Telegram, or your existing on-call.

Metrics

What we monitor

SMART Overall Health (PASS / FAIL)

`smartctl -H` overall assessment. PASS = no pre-fail attributes below threshold; FAIL = drive replacement urgent. The single most actionable disk health signal — alert immediately on any FAIL.

Reallocated_Sector_Ct (SMART 5)

Sectors the drive has remapped to spare area after read/write failures. > 0 = warning (drive is starting to fail); > 10 or any rapid increase = critical. The first Backblaze 5 attribute.

Current_Pending_Sector_Ct (SMART 197)

Sectors that failed a read and are waiting to be remapped on next write. Any non-zero value is a hard warning — the drive has detected unreadable data. Backblaze 5.

Offline_Uncorrectable (SMART 198)

Sectors permanently unreadable. Any non-zero value = data loss has occurred. Critical alert. Backblaze 5.

Reported_Uncorrectable_Errors (SMART 187)

ECC errors the drive's internal correction couldn't fix. Strong failure predictor in the Backblaze studies — alert on any growth.

Command_Timeout (SMART 188)

Commands that exceeded the SATA bus timeout. Rising values usually mean cable, controller, or drive electronics are failing. Backblaze 5.

Temperature_Celsius (SMART 194)

Current drive temperature. HDDs degrade above 50°C; consumer SSDs throttle above 70°C. Alert at vendor-rated maximum minus 10°C for early warning.

UDMA_CRC_Error_Count (SMART 199)

Cable-related CRC errors on the SATA/SAS interface. Rising values flag a bad cable or loose connection — easy fix that's often misdiagnosed as drive failure.

SSD Wear (Wear_Leveling_Count + Total_LBAs_Written)

SSD endurance tracking. `Wear_Leveling_Count` normalised remaining life; `Total_LBAs_Written` plus the drive's rated TBW gives current wear percentage. Alert at 80% used.

NVMe percentage_used

From `nvme smart-log` — vendor's estimate of life consumed (0–100%, can exceed 100% on worn drives). Warn above 80%; critical above 95%.

NVMe available_spare

Percentage of spare capacity remaining for bad-block replacement. Warn below 10%; critical below 5% (`available_spare_threshold` is typically set there).

NVMe critical_warning

Bitfield from `nvme smart-log` flagging: spare below threshold, temperature above threshold, device reliability degraded, read-only mode, volatile memory backup failed. Any non-zero value = immediate alert.

Triggers & Alerts

Configurable alert triggers

Set up custom triggers in your dashboard to get notified the moment Disk Health metrics cross your defined thresholds.

Disk Health monitoring trigger configuration dashboard

SMART Health Status

critical

Fires when SMART reports a failing health status.

Reallocated Sectors

critical

Alerts when reallocated sector count exceeds threshold.

Disk Temperature

warning

Triggers when disk temperature exceeds safe operating range.

Pending Sectors

warning

Fires when pending sector count indicates potential failure.

Importance of Disk Health Monitoring

Disk failures can result in data loss and costly downtime. SMART monitoring provides early warning signs — from rising temperatures and increasing reallocated sectors to read error spikes — so you can act before a drive fails.

Prevent data loss with early failure detection
Optimize performance by identifying bottlenecks
Plan capacity with historical trend analysis
Maintain compliance with data integrity monitoring

Why Choose Xitoring

Xitoring provides zero-config disk health monitoring with SMART integration for all disk types. Get real-time alerts, historical trends, and predictive failure indicators in a unified dashboard.

Supports SSDs, HDDs, and RAID arrays
One-command setup on Linux & Windows
Customizable SMART attribute thresholds
Multi-channel alerting for critical disk events

Use cases

Common disk health monitoring scenarios

Where disk monitoring most often catches drive failures before they cause real damage.

Database servers

A failed drive in a database can mean downtime, lost orders, or in the worst case, corrupted data. We watch every drive for the early warning signs of failure so the team can swap a struggling disk on their own schedule — not in the middle of an outage at 3 AM.

Backup and archive servers

The unique problem with backup drives is that a failure stays invisible until the day you actually need the backup — by then it's too late. We test each drive on a schedule and surface wear early so you never reach for a backup that isn't there.

Servers that write a lot of data (SSDs)

SSDs have a limited number of writes before they wear out, and busy databases and data-heavy apps burn through them faster than most teams realize. We track wear in plain percentages so drives get replaced on time — not after a sudden, unrecoverable failure.

Before you start

Prerequisites for Disk Health

Make sure you've got these in place — most installs are a 60-second job once they are.

Linux server (Debian/Ubuntu, RHEL/CentOS/Alma/Rocky, or compatible) — OR Windows Server
Linux: smartmontools (apt install smartmontools / dnf install smartmontools) + nvme-cli (for NVMe drives) + lsblk
sudo / root access — SMART data requires elevated permissions

Setup Guide

Get started in minutes

Install prerequisites (Linux)

Install smartmontools to enable SMART data collection. Ensure lsblk is available on your system.

# Ubuntu/Debian
sudo apt-get install smartmontools

# CentOS/RHEL
sudo yum install smartmontools

Enable Disk Health integration

Run the integrate command and select Disk Health. Xitogent will auto-detect your disks and start collecting SMART data. No prerequisites needed on Windows.

xitogent integrate

Verify it's working

Run this command on the server to confirm Xitogent picked up the integration. Fresh metrics will start streaming to your dashboard within ~30 seconds.

sudo xitogent status

Compare

Considering alternatives?

See how Xitoring stacks up against the alternatives for Disk Health monitoring — flat pricing, deeper integrations, and one agent that covers your whole stack.

Xitoring vs

Datadog

Pay-per-host pricing gets expensive fast at scale. See where Xitoring delivers the same coverage on a flat plan.

Xitoring vs

New Relic

Full-stack observability without the enterprise tiers, ingestion fees, or seat-based licensing.

Xitoring vs

Grafana Cloud

One tool with one price instead of stitching Prometheus, Loki, and Grafana into a stack you also have to monitor.

See all comparisons

Frequently asked questions

What is SMART monitoring?

SMART (Self-Monitoring, Analysis, and Reporting Technology) is the standard built into virtually every modern SATA, SAS, and NVMe drive for tracking internal health attributes — reallocated sectors, pending sectors, uncorrectable errors, temperature, power-on hours, wear levels. SMART monitoring is the practice of reading those attributes and alerting when they cross failure-predictive thresholds. Xitoring uses `smartctl` (smartmontools) on Linux and equivalent Windows APIs.

How do I check disk health on Linux?

Install smartmontools (`apt install smartmontools`), then `sudo smartctl -a /dev/sda` for full attribute output, `sudo smartctl -H /dev/sda` for overall PASS/FAIL, `sudo smartctl -t short /dev/sda` to start a quick self-test (results in ~2 min via `smartctl -l selftest /dev/sda`). For NVMe: `sudo nvme smart-log /dev/nvme0n1`. Xitogent automates all of this and trends the attributes over time.

What does Reallocated_Sector_Ct mean?

Reallocated_Sector_Ct (SMART attribute ID 5) is the number of sectors the drive has remapped to spare area after detecting read/write failures. The raw value is the count of actual remapped sectors. > 0 = the drive has started failing and should be on a replacement watchlist; > 10 or any rapid increase = critical, replace soon. It's one of the "Backblaze 5" attributes that most strongly predict imminent failure.

How do I monitor SSD wear-out?

Two attributes matter most: `Wear_Leveling_Count` (SMART 177) normalised value goes from 100 down to 0 as the drive wears (lower = more wear), and `Total_LBAs_Written` (SMART 241) lets you compute total TBs written against the drive's rated endurance (TBW). On NVMe, `percentage_used` is the vendor's direct lifetime estimate. Replace drives before they hit 80% used to avoid the sudden read-only mode some drives enter at end-of-life.

What SMART attributes predict drive failure?

The Backblaze quarterly Drive Stats reports (the gold standard, drawing from millions of drive-years of data) identify five attributes as the strongest failure predictors: SMART 5 (Reallocated_Sector_Ct), SMART 187 (Reported_Uncorrectable_Errors), SMART 188 (Command_Timeout), SMART 197 (Current_Pending_Sector_Ct), and SMART 198 (Offline_Uncorrectable). Any non-zero raw value on 187/197/198 warrants attention; rapid growth on any of the five = imminent failure.

How do I monitor NVMe drive health?

Use `nvme-cli`: `sudo nvme smart-log /dev/nvme0n1` returns `critical_warning` (bitfield — any non-zero is immediate alert), `temperature`, `available_spare` (warn < 10%, critical < 5%), `percentage_used` (warn > 80%), `data_units_written` (× 512KB blocks for total written), `media_errors`, `num_err_log_entries`, `unsafe_shutdowns`. Xitogent reads all of them and trends over time.

How do I monitor disk health on Windows?

Windows exposes SMART via WMI (`Win32_DiskDrive`, `MSStorageDriver_FailurePredictStatus`) and via the Storage Spaces PowerShell module (`Get-PhysicalDisk`, `Get-StorageReliabilityCounter`). The Xitogent Windows agent reads both and translates to the same metric set as Linux. is a popular GUI for ad-hoc inspection.

How often should I run smartctl self-tests?

Modern cadence: weekly short tests (~2 minutes, non-disruptive) + monthly long tests (~hours, slight performance impact, run during low-IO windows). Configure via `smartd.conf` for automatic scheduling. Xitogent picks up self-test results from `smartctl -l selftest` and surfaces failures even if the drive's overall SMART health still reports PASS.

Does this work with RAID arrays?

Yes for most modern RAID controllers (LSI/Broadcom megaraid, HP Smart Array, Adaptec) — they pass through SMART data per-drive. Use `smartctl -d megaraid,N /dev/sda` style syntax (Xitogent auto-detects). For software RAID (Linux mdraid, ZFS), each underlying drive is independently monitorable. For RAID-on-chip enclosures that hide SMART, you're stuck with controller-level health only.

Start monitoring Disk Health today

Set up in under 60 seconds. No credit card required. Full metrics from day one.

Start Free Trial

Keep exploring

Related Integrations

Docker

Redis

MySQL