Containers & System Health
    Updated May 2026
    Disk Health logo

    Disk Health Monitoring

    Monitor SMART attributes (Reallocated_Sector_Ct, Current_Pending_Sector_Ct, Offline_Uncorrectable, Temperature, UDMA_CRC_Error_Count) plus NVMe percentage_used, available_spare, and critical_warning in real time.

    Why monitor Disk Health?

    Disk failures cause more data loss than any other hardware issue — and SMART data predicts most of them days or weeks ahead. Reallocated sectors creep up, NVMe spare capacity drops, temperatures spike. Monitoring those signals is the highest-ROI alert you can set: catch the failing drive while you still have time to migrate data and swap it without downtime.

    SMART attribute monitoring (HDD and SSD) via smartctl
    NVMe-cli health log integration (`available_spare`, `percentage_used`, `critical_warning`)
    Backblaze 5 failure-predicting attributes pre-configured
    RAID controller passthrough support (most LSI/Broadcom, HP Smart Array, megaraid)
    Temperature tracking with per-drive thresholds
    Self-test history monitoring (short/long/conveyance results)
    SSD wear-out + endurance tracking (`Total_LBAs_Written`, `Wear_Leveling_Count`)
    Per-drive alert thresholds (warn vs critical)
    Native Linux and Windows agent support
    1-minute metric collection intervals out of the box
    What is disk health monitoring?

    Disk health monitoring, explained

    Disk health monitoring catches reallocated-sector growth, NVMe wear-out, temperature spikes, and impending failure indicators days or weeks before drives die — long enough to migrate data and swap the drive without downtime. For database servers, backup hosts, and any workload where drive failure means data loss, SMART monitoring is the single highest-ROI alert you can set. Xitoring runs smartctl + nvme-cli locally and routes alerts to Slack, PagerDuty, Telegram, or your existing on-call.

    Metrics

    What we monitor

    SMART Overall Health (PASS / FAIL)

    `smartctl -H` overall assessment. PASS = no pre-fail attributes below threshold; FAIL = drive replacement urgent. The single most actionable disk health signal — alert immediately on any FAIL.

    Reallocated_Sector_Ct (SMART 5)

    Sectors the drive has remapped to spare area after read/write failures. > 0 = warning (drive is starting to fail); > 10 or any rapid increase = critical. The first Backblaze 5 attribute.

    Current_Pending_Sector_Ct (SMART 197)

    Sectors that failed a read and are waiting to be remapped on next write. Any non-zero value is a hard warning — the drive has detected unreadable data. Backblaze 5.

    Offline_Uncorrectable (SMART 198)

    Sectors permanently unreadable. Any non-zero value = data loss has occurred. Critical alert. Backblaze 5.

    Reported_Uncorrectable_Errors (SMART 187)

    ECC errors the drive's internal correction couldn't fix. Strong failure predictor in the Backblaze studies — alert on any growth.

    Command_Timeout (SMART 188)

    Commands that exceeded the SATA bus timeout. Rising values usually mean cable, controller, or drive electronics are failing. Backblaze 5.

    Temperature_Celsius (SMART 194)

    Current drive temperature. HDDs degrade above 50°C; consumer SSDs throttle above 70°C. Alert at vendor-rated maximum minus 10°C for early warning.

    UDMA_CRC_Error_Count (SMART 199)

    Cable-related CRC errors on the SATA/SAS interface. Rising values flag a bad cable or loose connection — easy fix that's often misdiagnosed as drive failure.

    SSD Wear (Wear_Leveling_Count + Total_LBAs_Written)

    SSD endurance tracking. `Wear_Leveling_Count` normalised remaining life; `Total_LBAs_Written` plus the drive's rated TBW gives current wear percentage. Alert at 80% used.

    NVMe percentage_used

    From `nvme smart-log` — vendor's estimate of life consumed (0–100%, can exceed 100% on worn drives). Warn above 80%; critical above 95%.

    NVMe available_spare

    Percentage of spare capacity remaining for bad-block replacement. Warn below 10%; critical below 5% (`available_spare_threshold` is typically set there).

    NVMe critical_warning

    Bitfield from `nvme smart-log` flagging: spare below threshold, temperature above threshold, device reliability degraded, read-only mode, volatile memory backup failed. Any non-zero value = immediate alert.

    Triggers & Alerts

    Configurable alert triggers

    Set up custom triggers in your dashboard to get notified the moment Disk Health metrics cross your defined thresholds.

    Disk Health monitoring trigger configuration dashboard

    SMART Health Status

    critical

    Fires when SMART reports a failing health status.

    Reallocated Sectors

    critical

    Alerts when reallocated sector count exceeds threshold.

    Disk Temperature

    warning

    Triggers when disk temperature exceeds safe operating range.

    Pending Sectors

    warning

    Fires when pending sector count indicates potential failure.

    01

    Importance of Disk Health Monitoring

    Disk failures can result in data loss and costly downtime. SMART monitoring provides early warning signs — from rising temperatures and increasing reallocated sectors to read error spikes — so you can act before a drive fails.

    • Prevent data loss with early failure detection
    • Optimize performance by identifying bottlenecks
    • Plan capacity with historical trend analysis
    • Maintain compliance with data integrity monitoring
    Disk health SMART monitoring
    Predictive disk failure analysis
    02

    Why Choose Xitoring

    Xitoring provides zero-config disk health monitoring with SMART integration for all disk types. Get real-time alerts, historical trends, and predictive failure indicators in a unified dashboard.

    • Supports SSDs, HDDs, and RAID arrays
    • One-command setup on Linux & Windows
    • Customizable SMART attribute thresholds
    • Multi-channel alerting for critical disk events
    Xitoring disk health dashboard
    Disk health alerts
    Use cases

    Common disk health monitoring scenarios

    Where disk monitoring most often catches drive failures before they cause real damage.

    Database servers

    A failed drive in a database can mean downtime, lost orders, or in the worst case, corrupted data. We watch every drive for the early warning signs of failure so the team can swap a struggling disk on their own schedule — not in the middle of an outage at 3 AM.

    Backup and archive servers

    The unique problem with backup drives is that a failure stays invisible until the day you actually need the backup — by then it's too late. We test each drive on a schedule and surface wear early so you never reach for a backup that isn't there.

    Servers that write a lot of data (SSDs)

    SSDs have a limited number of writes before they wear out, and busy databases and data-heavy apps burn through them faster than most teams realize. We track wear in plain percentages so drives get replaced on time — not after a sudden, unrecoverable failure.

    Before you start

    Prerequisites for Disk Health

    Make sure you've got these in place — most installs are a 60-second job once they are.

    • Linux server (Debian/Ubuntu, RHEL/CentOS/Alma/Rocky, or compatible) — OR Windows Server
    • Linux: smartmontools (apt install smartmontools / dnf install smartmontools) + nvme-cli (for NVMe drives) + lsblk
    • sudo / root access — SMART data requires elevated permissions
    Setup Guide

    Get started in minutes

    1

    Install prerequisites (Linux)

    Install smartmontools to enable SMART data collection. Ensure lsblk is available on your system.

    # Ubuntu/Debian sudo apt-get install smartmontools # CentOS/RHEL sudo yum install smartmontools
    2

    Enable Disk Health integration

    Run the integrate command and select Disk Health. Xitogent will auto-detect your disks and start collecting SMART data. No prerequisites needed on Windows.

    xitogent integrate
    3

    Verify it's working

    Run this command on the server to confirm Xitogent picked up the integration. Fresh metrics will start streaming to your dashboard within ~30 seconds.

    sudo xitogent status

    Frequently asked questions

    What is SMART monitoring?
    SMART (Self-Monitoring, Analysis, and Reporting Technology) is the standard built into virtually every modern SATA, SAS, and NVMe drive for tracking internal health attributes — reallocated sectors, pending sectors, uncorrectable errors, temperature, power-on hours, wear levels. SMART monitoring is the practice of reading those attributes and alerting when they cross failure-predictive thresholds. Xitoring uses `smartctl` (smartmontools) on Linux and equivalent Windows APIs.
    How do I check disk health on Linux?
    Install smartmontools (`apt install smartmontools`), then `sudo smartctl -a /dev/sda` for full attribute output, `sudo smartctl -H /dev/sda` for overall PASS/FAIL, `sudo smartctl -t short /dev/sda` to start a quick self-test (results in ~2 min via `smartctl -l selftest /dev/sda`). For NVMe: `sudo nvme smart-log /dev/nvme0n1`. Xitogent automates all of this and trends the attributes over time.
    What does Reallocated_Sector_Ct mean?
    Reallocated_Sector_Ct (SMART attribute ID 5) is the number of sectors the drive has remapped to spare area after detecting read/write failures. The raw value is the count of actual remapped sectors. > 0 = the drive has started failing and should be on a replacement watchlist; > 10 or any rapid increase = critical, replace soon. It's one of the "Backblaze 5" attributes that most strongly predict imminent failure.
    How do I monitor SSD wear-out?
    Two attributes matter most: `Wear_Leveling_Count` (SMART 177) normalised value goes from 100 down to 0 as the drive wears (lower = more wear), and `Total_LBAs_Written` (SMART 241) lets you compute total TBs written against the drive's rated endurance (TBW). On NVMe, `percentage_used` is the vendor's direct lifetime estimate. Replace drives before they hit 80% used to avoid the sudden read-only mode some drives enter at end-of-life.
    What SMART attributes predict drive failure?
    The Backblaze quarterly Drive Stats reports (the gold standard, drawing from millions of drive-years of data) identify five attributes as the strongest failure predictors: SMART 5 (Reallocated_Sector_Ct), SMART 187 (Reported_Uncorrectable_Errors), SMART 188 (Command_Timeout), SMART 197 (Current_Pending_Sector_Ct), and SMART 198 (Offline_Uncorrectable). Any non-zero raw value on 187/197/198 warrants attention; rapid growth on any of the five = imminent failure.
    How do I monitor NVMe drive health?
    Use `nvme-cli`: `sudo nvme smart-log /dev/nvme0n1` returns `critical_warning` (bitfield — any non-zero is immediate alert), `temperature`, `available_spare` (warn < 10%, critical < 5%), `percentage_used` (warn > 80%), `data_units_written` (× 512KB blocks for total written), `media_errors`, `num_err_log_entries`, `unsafe_shutdowns`. Xitogent reads all of them and trends over time.
    How do I monitor disk health on Windows?
    Windows exposes SMART via WMI (`Win32_DiskDrive`, `MSStorageDriver_FailurePredictStatus`) and via the Storage Spaces PowerShell module (`Get-PhysicalDisk`, `Get-StorageReliabilityCounter`). The Xitogent Windows agent reads both and translates to the same metric set as Linux. is a popular GUI for ad-hoc inspection.
    How often should I run smartctl self-tests?
    Modern cadence: weekly short tests (~2 minutes, non-disruptive) + monthly long tests (~hours, slight performance impact, run during low-IO windows). Configure via `smartd.conf` for automatic scheduling. Xitogent picks up self-test results from `smartctl -l selftest` and surfaces failures even if the drive's overall SMART health still reports PASS.
    Does this work with RAID arrays?
    Yes for most modern RAID controllers (LSI/Broadcom megaraid, HP Smart Array, Adaptec) — they pass through SMART data per-drive. Use `smartctl -d megaraid,N /dev/sda` style syntax (Xitogent auto-detects). For software RAID (Linux mdraid, ZFS), each underlying drive is independently monitorable. For RAID-on-chip enclosures that hide SMART, you're stuck with controller-level health only.

    Start monitoring Disk Health today

    Set up in under 60 seconds. No credit card required. Full metrics from day one.

    Start Free Trial

    Keep exploring

    Related Integrations