Auto Fault Tolerance
Alert fatigue kills monitoring effectiveness. When infrastructure is unstable (network hiccups, deployments, brief outages), constant alerts for the same issue waste time and train teams to ignore notifications. Auto Fault Tolerance solves this by temporarily increasing the alert buffer during turbulent periods, then automatically reverting to sensitive monitoring when stability returns.
Alert Noise Reduction
Manual FT management: Constant tweaking, alert storms during incidents
Auto Fault Tolerance: Automatic adjustment, clean signal during chaos
Result: Only meaningful alerts reach your team, even during infrastructure stress.
What Auto Fault Tolerance Does
Dynamic Alert Buffer Adjustment
Fault Tolerance (FT) is the time buffer before Xitoring creates an incident. Auto Fault Tolerance automatically manages this buffer:
Stable Period (Normal Behavior):
- Configured FT: 5 minutes
- Service goes down → waits 5 min → alerts if still down
- Sensitive monitoring: Quick alerts for real problems
Unstable Period (Flapping/Turbulence):
- System detects: Multiple up/down cycles in short period
- Auto Fault Tolerance activates: Temporarily increases FT to 10-15 minutes
- Service flaps up/down/up/down → only alerts if sustained downtime
- Reduced noise: No alerts for brief hiccups
Return to Stability:
- System detects: No issues for 30+ minutes
- Auto Fault Tolerance deactivates: Returns to configured 5-minute FT
- Back to sensitive monitoring: Quick response to new incidents
Automatic, zero configuration - Xitoring decides when to adjust FT based on real-time behavior patterns.
How Auto Fault Tolerance Works
Detection of Instability
Auto Fault Tolerance activates when it detects flapping (rapid state changes):
Flapping Criteria:
- 3+ up/down transitions within 15-minute window
- Brief outages < configured FT (e.g., down for 2 min, up for 1 min, down again)
- Inconsistent checks - Some checks succeed, some fail, pattern repeats
Example - Flapping Web Server:
- 1:00 PM - Server down (network blip)
- 1:01 PM - Server back up
- 1:03 PM - Server down again (another blip)
- 1:04 PM - Server back up
- 1:05 PM - Server down (third time)
Without Auto FT: You'd receive 3 separate incident alerts in 5 minutes (alert storm)
With Auto FT: System detects flapping at 1:05 PM → increases FT to 15 minutes → only alerts if down for sustained 15+ min period
FT Adjustment Strategy
When flapping detected, Auto Fault Tolerance increases FT using this logic:
| Current FT | Increased To | Reasoning |
|---|---|---|
| 1 minute | 5 minutes | 5x increase for very sensitive triggers |
| 3 minutes | 10 minutes | ~3x increase |
| 5 minutes | 15 minutes | 3x increase (most common) |
| 10 minutes | 20 minutes | 2x increase |
| 15+ minutes | No change | Already highly tolerant |
Conservative increases - Auto FT doesn't make triggers insensitive; it prevents spam during known instability.
Return to Normal Detection
Auto Fault Tolerance reverts FT when stability detected:
Stability Criteria:
- No incidents for 30+ consecutive minutes
- All checks succeeding consistently
- No flapping patterns observed
Gradual reversion:
- After 30 min stable → reduce FT by 50% (e.g., 15 min → 7-8 min)
- After another 30 min stable → reduce to original configured FT (e.g., 7 min → 5 min)
- Prevents oscillation - Doesn't immediately snap back, gradually restores sensitivity
When Auto Fault Tolerance Helps
Perfect Scenarios for Auto FT
✅ Network instability - ISP issues causing intermittent connectivity
✅ Deployment windows - Rolling restarts cause brief up/down cycles
✅ Infrastructure maintenance - Cloud provider maintenance windows
✅ DNS propagation - Changes causing temporary resolution failures
✅ Overloaded services - Service flapping under heavy load before stabilizing
✅ Third-party dependencies - External API instability affecting your checks
Common pattern: Something is wrong, but alerting every 2 minutes doesn't help—you're already investigating and fixing it.
Not Helpful When
❌ Sustained outages - If service actually down for 20 minutes, Auto FT just delays inevitable alert
❌ Single brief blips - One 30-second outage doesn't trigger Auto FT (this is what normal FT handles)
❌ Gradual degradation - Slowly increasing response times (not flapping, just getting worse)
Auto FT specifically targets alert storm scenarios - rapid repeated failures, not all monitoring situations.
Auto Fault Tolerance in Action
Scenario 1: Network Maintenance
Timeline:
| Time | Event | Without Auto FT | With Auto FT |
|---|---|---|---|
| 2:00 AM | Network maintenance starts | - | - |
| 2:05 AM | First connectivity loss | 🚨 Alert sent | Auto FT detects start |
| 2:07 AM | Connectivity restored | ✅ Resolved | No alert yet |
| 2:10 AM | Second connectivity loss | 🚨 Second alert | FT increased to 15 min |
| 2:13 AM | Connectivity restored | ✅ Resolved | Still monitoring |
| 2:16 AM | Third connectivity loss | 🚨 Third alert | Waiting (FT = 15 min) |
| 2:35 AM | Maintenance complete | ✅ Resolved | No alert sent |
| 3:05 AM | Stable for 30 minutes | - | FT gradually reduced |
| 3:35 AM | Fully stable | - | FT back to normal 5 min |
Result:
- Without Auto FT: 3 alerts during known maintenance (alert storm, wasted attention)
- With Auto FT: 0 alerts (maintenance completed before extended FT expired)
Scenario 2: Service Under Load
Situation: E-commerce site during Black Friday sale
Symptoms:
- Web server occasionally fails health checks (overloaded, slow response)
- Fails check → load decreases slightly → passes check → load increases → fails again
- Cycle repeats 5-6 times over 20 minutes
Without Auto FT:
- 5-6 separate incident alerts
- Ops team already aware (traffic spike expected)
- Each alert is noise, not signal
With Auto FT:
- After second failure (cycle detected), FT increases to 15 minutes
- No alerts during 20-minute turbulence period
- Service stabilizes after load balancer scaled
- FT gradually returns to normal
- Only alerts if sustained outage occurs
Scenario 3: DNS Propagation
Situation: Migrating DNS from one provider to another
Timeline:
- DNS change made at 10:00 AM
- Some Xitoring monitoring nodes still resolve old IP (cached)
- Other nodes resolve new IP correctly
- Check results: Fail, Success, Fail, Success (pattern repeats)
Without Auto FT:
- Alerts for each node that temporarily can't reach service
- False positive—service is fine, DNS propagating normally
With Auto FT:
- Flapping detected after 3-4 cycles
- FT increased, no alerts sent
- DNS propagation completes within 30 minutes
- All nodes resolve correctly
- Auto FT reverts, monitoring resumes
Result: No false alerts during expected DNS propagation delay.
Monitoring Auto FT Activity
Viewing Auto FT Status
Check if Auto FT currently active:
- Dashboard → Uptime → [Your Check] → Overview
- Look for indicator: "Auto Fault Tolerance Active"
- Shows:
- Current FT value (e.g., "15 minutes - Auto-Adjusted")
- Original configured FT (e.g., "Normal: 5 minutes")
- Why adjusted ("Detected instability pattern")
- When it started ("Activated 23 minutes ago")
For all checks:
- Dashboard → Uptime → Filter: "Auto FT Active"
- See list of all checks currently under Auto FT protection
Auto FT Event Log
Track Auto FT adjustments over time:
- Dashboard → Servers/Uptime → [Resource] → Activity Log
- Filter for Auto FT events:
- "Auto Fault Tolerance Activated" - FT increased
- "Auto Fault Tolerance Deactivated" - FT returned to normal
- Each event shows:
- Timestamp
- Reason (flapping detected, stability restored)
- FT change (5 min → 15 min, or 15 min → 5 min)
Use for:
- Understanding when infrastructure was unstable
- Correlating Auto FT activations with deployments or incidents
- Auditing alert suppression (why didn't this outage alert?)
Configuring Auto Fault Tolerance
Global Enable/Disable
Auto Fault Tolerance is enabled by default for all uptime checks.
To disable globally:
- Account Settings → Monitoring Preferences → Auto Fault Tolerance
- Toggle "Enable Auto Fault Tolerance" OFF
- Save settings
Impact:
- New checks: Auto FT will not activate
- Existing checks: Auto FT immediately deactivates (FT returns to configured values)
- You manage FT manually for all resources
When to disable globally:
- You prefer full manual control over alert timing
- Compliance requires fixed FT values (no dynamic adjustment)
- Your infrastructure never experiences flapping (rare)
Per-Check Configuration
Disable Auto FT for specific uptime check:
- Dashboard → Uptime → [Your Check] → Settings
- Advanced Options → Auto Fault Tolerance → Disable
- Save
When to disable per-check:
- Critical service where you want every outage alerted immediately (even during flapping)
- SLA monitoring (fixed FT required for compliance)
- Testing/dev checks (want to see all state changes for debugging)
Note: Auto FT only applies to uptime checks, not server monitoring triggers. Server triggers use static FT.
Auto FT vs Manual FT Management
Manual FT Management (Traditional Approach)
How it works:
- You set FT = 5 minutes when creating check
- Check flaps → alert storm → you manually increase FT to 15 minutes
- Flapping stops → you forget to reduce FT back to 5 minutes
- Days later, real outage → delayed alert because FT still 15 minutes
Problems:
- Requires manual intervention during incidents (when you're already busy)
- Easy to forget to revert FT after crisis
- Permanently loosened FT misses future fast incidents
Auto FT (Automated Approach)
How it works:
- You set FT = 5 minutes when creating check
- Check flaps → Auto FT increases to 15 minutes automatically
- Flapping stops → Auto FT reverts to 5 minutes automatically
- Days later, real outage → alerts after 5 minutes (original sensitivity)
Advantages:
- Zero manual work (automatic adjustment)
- Always reverts to sensitive monitoring
- Consistent behavior (no human error)
Best practice: Use Auto FT enabled, set base FT to your desired "normal" sensitivity (typically 3-5 minutes for production, 10-15 for dev/test).
Limitations & Edge Cases
What Auto FT Cannot Detect
Auto Fault Tolerance works on pattern recognition. It cannot intelligently handle:
❌ Scheduled maintenance - Doesn't know your deployment is planned at 2 AM
Solution: Use Maintenance Schedules to silence alerts during known windows
❌ Complex failure modes - Can't distinguish "database backup causing load" from "database under attack"
Solution: Create specific triggers with custom FT for known periodic events
❌ Single long outage - If service goes down and stays down, Auto FT doesn't help (alerts after extended FT period)
Solution: This is intended behavior—sustained outages should alert
Potential Delayed Alerts
Scenario where Auto FT might delay important alert:
- Service has been flapping for 20 minutes (Auto FT active, FT = 15 min)
- Flapping stops, service goes completely down
- Because FT now 15 min, alert delayed until service down for 15+ minutes
Mitigation:
- Auto FT gradually reduces FT as stability observed
- If truly critical service, disable Auto FT for that specific check
- Use Heartbeat monitoring as backup (harder to flap)
False Stability Detection
Rare case: Service flaps for 15 min, then stable for 30 min, then flaps again
Auto FT behavior:
- First flapping → FT increased to 15 min
- 30 min stability → FT reduced back to 5 min
- Second flapping starts → FT increases again
Not a bug: Auto FT responds to observed patterns. If infrastructure oscillates between stable/unstable, Auto FT will track it.
Best Practices
Combine with Other Features
Auto FT works best alongside:
- Maintenance Schedules - Silence alerts during planned work
- Auto-Triggers - Intelligent baseline thresholds reduce false positives
- Notification Roles - Route reduced alerts to appropriate teams
- Manual FT tuning - Set base FT appropriate for each check type (3 min for critical, 10 min for dev)
Set Appropriate Base FT
Auto FT adjusts FROM your configured FT value. Choose wisely:
| Check Type | Recommended Base FT | Why | Auto FT Adjusts To |
|---|---|---|---|
| Critical production API | 3 minutes | Fast alerting | ~10 minutes during instability |
| Customer-facing website | 5 minutes | Balance speed/accuracy | ~15 minutes during instability |
| Internal service | 10 minutes | Reduce noise | ~20 minutes during instability |
| Dev/test environment | 15 minutes | Minimal alerts | No adjustment (already high) |
Rule of thumb: Base FT should be the shortest time you can tolerate for "real" outage alerts. Auto FT handles the turbulent periods.
Review Auto FT Activity Logs
Monthly audit:
- Check which uptime checks frequently trigger Auto FT
- Investigate root cause (why is this service flapping?)
- Consider fixing underlying instability instead of relying on Auto FT
- Auto FT is a safety net, not a solution to broken infrastructure
Troubleshooting
Auto FT Not Activating During Flapping
Symptoms: Service clearly flapping (up/down/up/down), but Auto FT not increasing FT
Solutions:
- Verify Auto FT enabled:
- Check → Settings → Auto Fault Tolerance → Ensure enabled
- Account Settings → Verify global Auto FT enabled
- Check flapping frequency:
- Auto FT requires 3+ state changes within 15 minutes
- If cycling slower (e.g., down for 10 min, up for 10 min), may not trigger
- Review configured FT:
- If base FT already 15+ minutes, Auto FT may not adjust (already tolerant)
- Check resource type:
- Auto FT only applies to uptime checks, not server monitoring
Auto FT Not Reverting to Normal
Symptoms: Auto FT activated days ago, service now stable, but FT still elevated
Solutions:
- Wait longer:
- Auto FT requires 30+ min of stability before reverting
- Reversion is gradual (may take 60-90 min to fully return)
- Verify stability:
- Check → Incident History → Ensure no recent incidents
- If still occasional failures, Auto FT won't revert
- Manual override:
- Check → Settings → Advanced → "Reset Auto FT to Default"
- Forces immediate return to configured FT
- Check for underlying issues:
- If Auto FT not reverting, service may still be subtly unstable
- Investigate metrics for patterns
Unexpected Alert Delays
Symptoms: Service went down, alert took longer than expected
Solutions:
- Check if Auto FT was active:
- Review Activity Log for "Auto FT Activated" event before outage
- If active, extended FT explains delay
- Review recent flapping:
- Check incident history for previous up/down cycles (triggered Auto FT)
- Verify current FT value:
- Check settings may show "FT: 15 min (Auto-Adjusted from 5 min)"
- Explains longer wait period
- Disable Auto FT if critical:
- For checks requiring immediate alerting regardless of flapping, disable Auto FT
- Accept alert storms during instability as trade-off for speed
Common Scenarios
Scenario: Cloud Provider Maintenance
Situation: AWS announces maintenance window 2-4 AM
Pre-maintenance:
- Create Maintenance Schedule for 2-4 AM
- Alerts fully silenced during this window
During maintenance:
- Unexpected issues extend maintenance to 4:30 AM
- Services start flapping at 4:05 AM (outside maintenance window)
- Auto FT activates at 4:05 AM (detects flapping)
- No alerts during short 25-minute overrun
Post-maintenance:
- Services stabilize by 4:30 AM
- Auto FT detects stability at 5:00 AM (30 min stable)
- FT gradually returns to normal by 5:30 AM
Result: Maintenance schedule + Auto FT = smooth alert management even when maintenance runs long.
Scenario: Rolling Deployment
Situation: Deploying new version across 10 web servers
Deployment process:
- Take server out of load balancer → deploy → restart → health check → add back to LB
- Each server: 2-3 minute downtime
- Staggered deployments: every 5 minutes
Without Auto FT:
- 10 separate "Server Unreachable" alerts
- Each resolves in 2-3 minutes
- Alert noise, even though deployment is expected
With Auto FT:
- First server down → alerts after 5 min FT (expected)
- Second server down at +5 min, first still recovering → flapping detected
- Auto FT increases FT to 15 minutes
- Remaining servers deploy without alerting (each down < 15 min)
- All healthy by deployment end
- Auto FT reverts after 30 min stability
Better approach: Combine Auto FT with Maintenance Schedule for cleaner alert management during planned work.
Scenario: ISP Routing Issue
Situation: Your ISP has intermittent routing problems
Symptoms:
- External uptime checks (HTTP/Ping) fail sporadically
- Failures last 30-60 seconds, then resolve
- Pattern repeats every 5-10 minutes for hours
Without Auto FT:
- Dozens of alerts over several hours
- No action you can take (ISP's problem)
- Eventually mute notifications out of frustration
With Auto FT:
- After 3rd failure cycle, Auto FT increases FT to 15 minutes
- Further short blips don't alert (each < 15 min)
- If ISP outage becomes sustained (> 15 min), you do get alert
- ISP resolves issue after 3 hours
- Auto FT reverts after 30 min stability
Result: No alert storm, but still protected if short blips become real sustained outage.
See Also
- Fault Tolerance (Glossary) - Understanding FT concept
- Auto-Triggers - Automatic threshold recommendations
- Maintenance Schedules - Silence alerts during planned work
- Notification Roles - Configure alert routing
- FAQ: Understanding Fault Tolerance - FT fundamentals
- Uptime Monitoring - All uptime check types
- Glossary: Auto Fault Tolerance - Term definition
Next Step: Combine Auto Fault Tolerance with Maintenance Schedules for complete alert management during planned and unplanned instability!