What Is Self-Healing Infrastructure
Self-healing infrastructure automatically detects, diagnoses, and recovers from failures without human intervention. osModa implements self-healing through three integrated components: osmoda-watch for process restart, NixOS SafeSwitch for atomic rollback, and a SHA-256 hash-chained audit ledger for forensic recording.
The Three Pillars of Self-Healing
Self-healing is not a single mechanism -- it is a layered system where each layer handles a different class of failure. osModa's self-healing stack has three layers that work together in a closed loop:
Watchdog (osmoda-watch)
Monitors all agent processes and osModa daemons. When a process crashes or fails a health check, osmoda-watch restarts it with a 6-second median recovery time. This handles the most common failure mode: a process that exits unexpectedly due to an OOM kill, unhandled exception, or dependency timeout.
Atomic Rollback (NixOS SafeSwitch)
If a process keeps crashing after restart -- because a configuration change, package update, or dependency conflict broke it -- restarting the process will never fix it. SafeSwitch detects this pattern and rolls back the entire NixOS system to the previous generation in under 5 seconds. The system is now back to the last known-good configuration.
Audit Ledger (SHA-256 Hash Chain)
Every event in the self-healing chain is recorded: the initial crash (timestamp, exit code, process name), each restart attempt, the escalation to rollback, and the post-recovery health check results. The hash chain structure makes these records tamper-evident. This data powers root cause analysis and compliance reporting.
The Self-Healing Loop
Self-healing operates as a closed loop: detect the failure, recover from it, and record what happened. This loop runs automatically and continuously.
Self-healing sequence
1. Agent process crashes (OOM, timeout, unhandled error) 2. osmoda-watch detects exit event (< 1 second) 3. Audit ledger records: crash event, exit code, timestamp 4. osmoda-watch restarts process (6-second median) 5. Health check runs against restarted process 6. IF healthy → audit records recovery, resume normal operation 7. IF unhealthy (crash loop detected) → a. SafeSwitch triggers NixOS atomic rollback (< 5 seconds) b. Previous generation activated c. All services restart with known-good configuration d. Audit records rollback event and generation change 8. Post-rollback health check confirms system is healthy
Why Self-Healing Matters for AI Agents
AI agents are long-running, stateful processes that interact with external systems through tool use. They crash more often than traditional web services because they make more external calls (LLM APIs, databases, third-party services), handle more complex state, and run for longer periods. A customer support agent that crashes at 2 AM and stays down until morning means hours of missed conversations.
Self-healing turns crashes from outages into blips. The 6-second watchdog restart means most crashes resolve before users notice. The atomic rollback safety net means even catastrophic configuration errors are recovered within seconds. The audit ledger means every incident is documented for post-mortem analysis, and every recovery is verified.
osModa's Self-Healing Implementation
The self-healing stack is built into every osModa server. It is not an add-on or a premium feature -- every plan from Solo ($14.99/mo, 2 CPU / 4 GB / 40 GB) to Scale ($125.99/mo, 16 CPU / 32 GB / 320 GB) includes the complete self-healing stack.
All 9 Rust daemons are supervised by osmoda-watch: agentd, osmoda-mcpd, osmoda-routines, osmoda-voice, osmoda-mesh, osmoda-keyd, osmoda-teachd, and osmoda-egress. Agent processes you deploy are also supervised automatically. The NixOS foundation provides the atomic rollback capability. The audit ledger runs continuously and records every significant event.
Servers are available in Frankfurt, Helsinki, Virginia, and Oregon. Multi-channel access (Telegram, WhatsApp, Discord, Slack, web) means you can check on your agents from any interface. The dashboard supports Claude Opus, Sonnet, Haiku, GPT-4o, and o3-mini for agent operations.
For the full technical breakdown, see the Self-Healing Agent Servers page.
Frequently Asked Questions
What is self-healing infrastructure?
Self-healing infrastructure is a system that automatically detects failures, determines the appropriate recovery action, and executes that action without human intervention. Instead of paging an engineer at 3 AM to restart a crashed process, the infrastructure handles recovery on its own. The term encompasses process restart (watchdog), configuration rollback (atomic rollback), and forensic recording (audit logging).
How does osModa implement self-healing?
osModa implements self-healing through three integrated components: osmoda-watch (watchdog daemon) for process-level crash recovery with 6-second median restart, NixOS SafeSwitch for system-level atomic rollback when process restarts are insufficient, and the SHA-256 hash-chained audit ledger for recording every failure and recovery event. These three components form a closed loop: detect, recover, record.
What is the difference between self-healing and auto-restart?
Auto-restart is one component of self-healing: restarting a crashed process. Self-healing is broader. If a process keeps crashing after restart (because the underlying configuration is broken), auto-restart alone creates a restart loop. Self-healing recognizes this pattern and escalates to a different recovery action -- in osModa's case, an atomic rollback to the previous NixOS generation that was known to work.
How fast does osModa recover from failures?
Process crashes are detected immediately via kernel event monitoring. osmoda-watch restarts the process with a 6-second median recovery time. If the process fails repeatedly, SafeSwitch triggers an atomic rollback to the previous NixOS generation, which takes under 5 seconds. Total recovery time from crash to healthy state is typically under 15 seconds for process failures and under 20 seconds for configuration failures.
Does self-healing work for all types of failures?
Self-healing handles the most common failure modes: process crashes, hung processes (detected via health checks), and broken configurations (detected via post-deployment health checks). It does not handle hardware failures (disk failure, network outage) or application-level logic bugs (the agent is running but producing incorrect results). For hardware failures, server replacement is needed. For logic bugs, the audit ledger provides the forensic data to diagnose the issue.
How does the audit ledger support self-healing?
The SHA-256 hash-chained audit ledger records every event in the self-healing chain: what crashed, when, with what exit code, how many restart attempts occurred, whether a rollback was triggered, and the health check results after recovery. This data enables post-incident analysis to understand root causes and prevent recurrence. The tamper-evident property ensures the forensic record cannot be altered.
Infrastructure That Heals Itself
Watchdog restart, atomic rollback, and tamper-proof audit logging on every server. Plans from $14.99/month.
Spawn Server