How watchdog auto-restart works on osModa
1
Agent crashes

Process exits unexpectedly at any time — your agent or any dependency failure.

2
6-second recovery

Watchdog daemon detects, restarts, and verifies health. Median 6-second cycle.

3
You get notified

Audit ledger records the incident. Check via Telegram or SSH next morning.

Deploy with WatchdogFrom $14.99/mo · full root SSH

Watchdog Auto-Restart for AI Agents

The osModa watchdog daemon monitors every agent process on your server. When an agent crashes, hangs, or leaks memory, the watchdog detects the failure and restarts the process automatically -- with a median recovery time of 6 seconds. If the restart fails, it escalates to a NixOS rollback. Every recovery action is recorded in the tamper-evident audit ledger. Your agents restart at 3am. You sleep.

Agent crashes in production are inevitable. LLM API timeouts, memory exhaustion from long-running conversations, unhandled exceptions in tool execution, network failures during external API calls -- the failure modes for AI agents are numerous and often unpredictable. In 2026, as enterprises deploy more agents into production workloads, the cost of agent downtime has increased dramatically. A customer-facing support agent that crashes at 2am and stays down until an engineer notices at 9am represents 7 hours of lost service. A data processing agent that crashes mid-batch and does not restart can delay critical business operations by hours or days.

The traditional approach -- a systemd unit with Restart=always -- handles only the simplest failure mode: process exit. It does not detect hangs, memory leaks, or unhealthy processes that are technically running but not functioning. It does not integrate with the deployment system to roll back bad configurations. It does not create compliance-grade audit records of recovery events. The osModa watchdog is a purpose-built Rust daemon that handles every failure mode AI agents exhibit in production, with the observability and audit trail that production operations require.

TL;DR

  • • The watchdog daemon detects crashes, hangs, memory leaks, and health failures through four independent mechanisms
  • • Median recovery time is 6 seconds from detection to the agent being fully operational again
  • • If restart fails, the watchdog escalates to a NixOS rollback to the previous known-good configuration
  • • Unlike systemd Restart=always, the watchdog is health-aware with memory leak detection and configurable escalation
  • • Every recovery event is logged in the tamper-evident audit ledger for debugging and compliance evidence

How the Watchdog Detects Failures

The watchdog uses multiple independent detection mechanisms. Each targets a different category of failure that AI agents experience in production.

Process Exit Monitoring

The most basic detection: the agent process terminates. The watchdog detects the exit immediately through kernel process notifications. It records the exit code and signal (SIGSEGV for segfaults, SIGKILL for OOM, SIGABRT for assertion failures) in the audit ledger. This handles crashes from unhandled exceptions, assertion failures, memory corruption, and OOM kills. Detection time: under 100 milliseconds.

Heartbeat Monitoring

The agent process sends periodic heartbeat signals to the watchdog. If the heartbeat stops (process is hung, deadlocked, or stuck in an infinite loop), the watchdog detects the absence and initiates recovery. The heartbeat interval and timeout are configurable. Default: heartbeat every 5 seconds, timeout after 15 seconds of silence. This catches hangs that process exit monitoring cannot detect because the process is technically still running.

Health Endpoint Checking

For agents that expose HTTP health endpoints, the watchdog periodically sends health check requests and validates the responses. This detects situations where the process is running, the heartbeat is active, but the agent is not actually functional -- for example, the LLM connection is broken, the database connection is lost, or an internal component has failed. The health check can validate response status codes, response body content, and response time thresholds.

Resource Monitoring

The watchdog monitors memory usage, CPU utilization, file descriptor count, and disk I/O for each agent process. When usage exceeds configured thresholds, the watchdog can preemptively restart the agent before a hard failure occurs. This is particularly valuable for detecting memory leaks in long-running agents -- the watchdog can restart the agent during a natural pause in its work cycle rather than waiting for the OOM killer to terminate it abruptly.

The Recovery Sequence

When the watchdog detects a failure, it follows a structured escalation sequence. Each step is logged in the tamper-evident audit ledger.

  1. 1

    Failure Detection and Classification

    The watchdog identifies the failure type (crash, hang, health check failure, resource exhaustion) and records the detection event in the audit ledger with full context: timestamp, agent ID, failure type, exit code/signal (if applicable), last known health state, and resource usage at time of failure. This classification determines the recovery strategy.

  2. 2

    Process Restart

    For crash and hang failures, the watchdog restarts the agent process using the current NixOS generation's configuration. Environment variables, secrets, and runtime parameters are re-injected from the secrets manager. The restart uses the same process configuration as the original launch, ensuring consistency. The restart event is logged with the new process ID and startup parameters.

  3. 3

    Health Verification

    After restart, the watchdog runs accelerated health checks to verify the agent is functional. If health checks pass, the recovery is complete and logged as successful. Total time from detection to verified recovery: median 6 seconds. If health checks fail, the watchdog proceeds to escalation.

  4. 4

    Escalation: NixOS Rollback

    If the restart fails (process does not start, health checks do not pass), the watchdog escalates to a NixOS generation rollback. This reverts the entire system configuration to the previous known-good state. This handles cases where a bad deployment introduced the failure -- rolling back the configuration often resolves the issue. The rollback and its outcome are logged in the audit ledger.

  5. 5

    Alerting

    At each stage, the watchdog can trigger configurable webhook alerts. You can set up alerts for any recovery event, only for escalations, or only for failures that cannot be automatically resolved. Integrate with Slack, PagerDuty, Opsgenie, or any webhook-compatible alerting system. Even without external alerts, every event is recorded in the audit ledger for review.

// Watchdog recovery timeline example
03:14:07.221  DETECT   agent:crewai-01 heartbeat timeout (15s)
03:14:07.223  ACTION   sending SIGTERM to pid 4821
03:14:07.224  AUDIT    logged: agent_hang_detected
03:14:09.501  DETECT   pid 4821 exited (SIGTERM)
03:14:09.503  ACTION   restarting agent:crewai-01
03:14:11.847  DETECT   pid 4935 started, running health checks
03:14:13.102  VERIFY   health check passed: HTTP 200 on :8080/health
03:14:13.103  AUDIT    logged: agent_recovery_success
03:14:13.104  RESULT   total recovery time: 5.883s

Common AI Agent Failure Modes

AI agents fail in ways that traditional software does not. The watchdog is designed to handle every category of failure that agents exhibit in production.

LLM API Timeouts

LLM providers experience latency spikes and outages. An agent waiting on an API response can hang indefinitely. The watchdog's heartbeat monitoring detects when the agent stops sending heartbeats during a blocked API call and triggers a restart. The agent resumes with a fresh connection to the LLM API.

Memory Leaks

Long-running agents accumulate memory through conversation histories, cached embeddings, and unreleased resources. Without intervention, the OOM killer terminates the process abruptly. The watchdog detects rising memory usage and can restart the agent preemptively during a natural pause, avoiding data loss from abrupt OOM termination.

Tool Execution Failures

Agents that call external tools (file operations, HTTP requests, database queries) can crash when tools return unexpected results. The watchdog detects the crash through process exit monitoring and restarts the agent. The audit ledger records the last tool call before the crash, helping you identify and fix the root cause.

Configuration Errors

A bad deployment can introduce configuration changes that cause agent startup failures. The watchdog detects the failure, attempts a restart, and when the restart also fails, escalates to a NixOS rollback. This reverts to the previous working configuration, often resolving the issue without any human intervention.

Frequently Asked Questions

How does the watchdog detect agent crashes?

The watchdog daemon monitors agent processes through multiple detection mechanisms: process exit monitoring (detects when the process terminates), heartbeat monitoring (detects when the process stops responding), health endpoint checking (detects when the process is running but unhealthy), and resource monitoring (detects memory leaks, CPU runaway, and file descriptor exhaustion). Each mechanism operates independently, so a failure in one detection method does not prevent others from catching the problem.

What is the 6-second recovery time?

6 seconds is the median recovery time from crash detection to the agent process being fully operational again. This includes: crash detection (typically under 1 second with heartbeat monitoring), process cleanup and state preparation (1-2 seconds), process restart (1-2 seconds), and health check verification (1-2 seconds). The 6-second figure is a median across production workloads -- simple agents recover faster, complex agents with longer initialization may take slightly longer.

What happens if the restart fails?

If the agent process fails to start after the restart attempt, the watchdog enters an escalation sequence. First, it retries the restart with a brief backoff. If the second attempt fails, it triggers a NixOS rollback to the previous known-good generation, which often resolves configuration-related startup failures. If the rollback also fails, the watchdog logs a critical event and can trigger webhook alerts. Every step of this escalation is recorded in the tamper-evident audit ledger.

How is this different from systemd restart?

systemd provides basic process restart (Restart=always), but it lacks the intelligence and integration of the osModa watchdog. The watchdog provides health-aware restart (not just process existence but actual health), memory leak detection and preemptive restart before OOM kill, integration with NixOS rollback for configuration-related failures, SHA-256 audit logging of every recovery event, and configurable escalation sequences. systemd is a general-purpose init system. The osModa watchdog is purpose-built for AI agent lifecycle management.

Can I configure the watchdog behavior?

Yes. The watchdog configuration is part of the NixOS system configuration. You can set: heartbeat interval and timeout, health check endpoints and expected responses, memory and CPU thresholds for preemptive restart, restart backoff strategy (linear, exponential, or fixed), maximum restart attempts before escalation, and webhook URLs for alerting. Changes to the watchdog configuration are themselves recorded in the audit ledger.

Does the watchdog monitor all processes or just agents?

The watchdog monitors all 9 osModa Rust daemons and all registered agent processes. It does not monitor arbitrary system processes. Agent processes are registered with the supervisor daemon during deployment, which provides the watchdog with the health check parameters, expected behavior profile, and escalation preferences for each agent.

What about graceful shutdown vs hard restart?

The watchdog always attempts a graceful shutdown first, sending SIGTERM and waiting for a configurable grace period (default 30 seconds). If the process does not terminate within the grace period, the watchdog sends SIGKILL. For hang detection, the watchdog sends SIGTERM first, allowing the agent to save state if possible. The audit ledger records whether each restart was graceful or forced, providing insight into agent shutdown behavior.

Is the watchdog included on all plans?

Yes. The watchdog daemon, along with all 8 other Rust daemons, is included on every osModa plan from $14.99/month to $125.99/month. There are no premium tiers or add-ons for reliability features. Every plan includes the full self-healing stack: watchdog auto-restart, NixOS atomic rollback, and tamper-evident audit logging.

Your Agents Deserve a Watchdog

Every osModa plan includes the watchdog daemon with 6-second median recovery, NixOS rollback escalation, and tamper-evident audit logging. From $14.99/month.

Last updated: March 2026