What Is a Watchdog Daemon

A watchdog daemon is a supervisor process that continuously monitors other processes and automatically restarts them when they crash, hang, or fail health checks. osModa's watchdog is osmoda-watch, a Rust daemon that achieves a 6-second median restart time for crashed agent processes.

Why Watchdogs Matter for AI Agents

AI agents are long-running processes that crash. They crash because LLM API calls time out, because agent code has bugs, because dependencies fail, because the system runs out of memory, or because of a dozen other failure modes that are impossible to eliminate entirely. The question is not whether an agent will crash but how quickly it recovers.

Without a watchdog, a crashed agent stays dead until someone notices and manually restarts it. With a watchdog, the crash is detected within seconds and the agent is restarted automatically. For production workloads -- customer support bots, data pipeline agents, monitoring systems -- the difference between "dead until someone notices" and "6-second recovery" is the difference between an outage and a blip.

osmoda-watch: osModa's Watchdog Implementation

osmoda-watch is one of osModa's 10 daemons, purpose-built for process supervision on AI agent infrastructure. It monitors all agent processes and critical services on the server, including the other 8 daemons: agentd, osmoda-mcpd, osmoda-routines, osmoda-voice, osmoda-mesh, osmoda-keyd, osmoda-teachd, and osmoda-egress.

Crash Detection

osmoda-watch monitors process exit events via the kernel. When a supervised process terminates unexpectedly, the watchdog is notified immediately -- no polling delay. For hang detection, configurable health check endpoints are polled at regular intervals.

6-Second Restart

The 6-second median restart includes crash detection, a brief backoff delay to prevent restart loops, process spawning, and health verification. Most restarts complete faster. The backoff increases for repeated failures to avoid thrashing on persistent errors.

Audit Integration

Every crash detection and restart event is recorded in the SHA-256 hash-chained audit ledger. The entry includes the process name, exit code, crash timestamp, restart timestamp, and the resulting health check status.

Rollback Escalation

If a process fails repeatedly despite restarts, osmoda-watch can trigger an atomic rollback via NixOS SafeSwitch, reverting the entire system to the last known-good configuration.

The Supervisor Hierarchy

Reliable supervision requires a chain of trust. On osModa, the hierarchy is:

Linux kernel -- supervises systemd (PID 1), the most reliable supervisor
systemd -- supervises osmoda-watch with Restart=always
osmoda-watch -- supervises all agent processes and osModa daemons
Agent processes -- the workloads being supervised

Each layer supervises the layer below it. osmoda-watch is a minimal Rust binary with no external dependencies, making it extremely unlikely to crash. If it does, systemd restarts it. If systemd somehow fails, the kernel restarts it. This layered approach ensures the supervision chain is never completely broken.

Watchdog as Part of Self-Healing Infrastructure

The watchdog daemon is one of three pillars of osModa's self-healing infrastructure. The other two are atomic rollback (reverting broken configurations) and hash-chained audit logging (recording every failure and recovery for forensic analysis).

Together, these three components form a closed loop: osmoda-watch detects and recovers from process-level failures. When process restarts are not sufficient (the process keeps crashing due to a configuration issue), SafeSwitch triggers an atomic rollback to a known-good NixOS generation. Every event in this chain is recorded in the audit ledger for post-incident analysis.

For the full watchdog implementation details, see the Watchdog Auto-Restart documentation.

Frequently Asked Questions

What is a watchdog daemon?

A watchdog daemon is a long-running supervisor process that monitors other processes and automatically restarts them when they crash, hang, or fail health checks. The watchdog itself is designed to be extremely reliable -- it is typically a simple, well-tested process with minimal dependencies. If a supervised process dies, the watchdog detects the failure and spawns a new instance.

How does osmoda-watch work?

osmoda-watch is a Rust daemon that supervises all agent processes and osModa services on the server. It monitors process health through configurable checks: process existence, port responsiveness, and custom health endpoints. When a supervised process crashes or fails a health check, osmoda-watch restarts it with a median recovery time of 6 seconds. Every crash and restart is recorded in the SHA-256 hash-chained audit ledger.

Why is the restart time 6 seconds?

The 6-second median restart time includes crash detection (process exit monitoring), restart delay (brief backoff to avoid restart loops), process startup (spawning the new process), and health verification (confirming the new process is healthy). This is the median -- most restarts are faster. The backoff delay prevents thrashing when a process has a persistent failure.

How does a watchdog differ from systemd restart?

systemd provides basic restart functionality through RestartSec and Restart=always directives. osmoda-watch builds on top of systemd with application-aware health checks, configurable restart policies per process, correlation with NixOS atomic rollback for persistent failures, and tamper-proof audit logging of every restart event. It understands the specific needs of AI agent processes, not just generic service management.

What happens if osmoda-watch itself crashes?

osmoda-watch runs as a systemd service with Restart=always, so systemd restarts it if it fails. Additionally, osmoda-watch is a minimal Rust binary with no external dependencies beyond the OS, making crashes extremely rare. The design follows the supervisor hierarchy pattern: the kernel supervises systemd, systemd supervises osmoda-watch, and osmoda-watch supervises agent processes.

Can I configure different restart policies per process?

Yes. osmoda-watch supports per-process configuration including restart delay, maximum restart attempts, health check endpoints, health check intervals, and custom pre-restart and post-restart hooks. An agent that needs a warm cache might have a longer restart delay. A critical service might have aggressive health checking with no delay.

Never Lose an Agent to a Crash

osmoda-watch supervises every process with 6-second restart. Atomic rollback handles persistent failures. Plans from $29/month.

Spawn Server

Explore More

Self-Healing Infrastructure

The complete self-healing stack

Atomic Rollback

NixOS generation-based reversion

Hash Chain Audit Log

Tamper-evident failure logging

Watchdog Auto-Restart

Full implementation guide

NixOS

Declarative Linux for AI infra

Agentic AI

Why agents need supervision