An AI agent is not a web server. Web servers receive a request, send a response, and idle. AI agents maintain state, reason over multi-step workflows, call external APIs, execute generated code, and accumulate context over hours or days. Every one of those behaviors creates failure modes that traditional process management was not designed for.
A production lesson that catches every team: “Don't rely on an agent to monitor itself. A dead process can't tell you it's dead. External, mutual monitoring is the way to go.” This principle shapes everything in this guide. We will work from the bottom up: process supervision first, then crash recovery, memory management, log rotation, and finally external monitoring.
Layer 1: Process Supervision
The most basic requirement for running 24/7 is ensuring the process restarts automatically when it crashes. There are several approaches, each with trade-offs.
systemd (Linux Native)
systemd is the init system on most Linux distributions. It manages services (processes) with restart policies, resource limits, dependency ordering, and integrated logging via journald. For AI agents on dedicated servers or VPS, systemd provides the lowest-overhead supervision.
# /etc/systemd/system/my-agent.service [Unit] Description=My AI Agent After=network-online.target Wants=network-online.target [Service] Type=simple User=agent WorkingDirectory=/opt/agent ExecStart=/opt/agent/venv/bin/python main.py Restart=on-failure RestartSec=5 StartLimitIntervalSec=300 StartLimitBurst=5 # Resource limits MemoryMax=4G CPUQuota=200% # Logging StandardOutput=journal StandardError=journal SyslogIdentifier=my-agent # Security hardening NoNewPrivileges=true ProtectSystem=strict ReadWritePaths=/opt/agent/data [Install] WantedBy=multi-user.target
Key directives explained: Restart=on-failure restarts only on non-zero exit (not on clean shutdown). RestartSec=5 waits 5 seconds between restarts to avoid hammering. StartLimitBurst=5 with StartLimitIntervalSec=300 means systemd stops restarting after 5 failures in 5 minutes, preventing infinite restart loops. MemoryMax=4G kills the process if it exceeds 4 GB, preventing memory leaks from consuming the server.
Enable and start with: sudo systemctl enable --now my-agent. Use loginctl enable-linger agent if running as a non-root user to ensure the service survives logouts and reboots.
Docker
Docker provides process supervision through restart policies and environment isolation through containers. The trade-off is container overhead (typically 50–100 MB of additional memory) in exchange for dependency isolation.
# docker-compose.yml
services:
agent:
build: .
restart: unless-stopped
deploy:
resources:
limits:
memory: 4G
cpus: "2.0"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
volumes:
- agent-data:/app/data
logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"
volumes:
agent-data:The healthcheck directive is critical. Docker's restart policy alone only detects process exit. The health check sends an HTTP request to the agent every 30 seconds. If 3 consecutive checks fail, Docker marks the container as unhealthy. Combined with restart: unless-stopped, this ensures the agent restarts on both crashes and hangs.
Kubernetes
For teams already running Kubernetes, liveness and readiness probes provide health-based restart decisions. Set livenessProbe.httpGet.path: /health with failureThreshold: 3 and periodSeconds: 10. Kubernetes will kill and reschedule the pod if liveness checks fail. However, Kubernetes adds significant operational complexity. For most AI agent deployments (1–10 agents), systemd or Docker Compose is sufficient. Kubernetes becomes justified at scale (dozens of agents across multiple nodes).
Approach Comparison
| Feature | systemd | Docker | Kubernetes | osModa |
|---|---|---|---|---|
| Auto-restart | Yes | Yes | Yes | Yes |
| Health checks | Manual (sd_notify) | Built-in | Built-in | Built-in |
| Memory limits | MemoryMax | --memory | limits.memory | Pre-configured |
| Log rotation | journald | json-file driver | Per-node config | Automatic |
| Rollback on failure | No | Image-level | Deployment-level | OS-level (NixOS) |
| Audit logging | No | No | Via add-ons | SHA-256 ledger |
| Complexity | Low | Medium | High | Low (managed) |
Layer 2: Crash Recovery
Restarting the process is necessary but not sufficient. AI agents maintain state — conversation history, workflow progress, partially completed tasks. A bare restart loses all of this.
Checkpointing: Design your agent to periodically save its state to persistent storage (database, file system, or Redis). After restart, the agent loads the latest checkpoint and resumes from where it left off. The checkpointing interval depends on the cost of re-doing work — for expensive LLM calls, checkpoint after every significant step.
Graceful shutdown: Handle SIGTERM in your agent code. When systemd or Docker sends SIGTERM before killing the process, use the grace period (typically 30 seconds) to flush state, complete the current task step, and save a final checkpoint. Set TimeoutStopSec=30 in systemd or stop_grace_period: 30s in Docker Compose.
Deployment rollback: Sometimes the crash is caused by a bad deployment, not a transient error. If your agent crashes in a loop after an update, process supervision will keep restarting it until it hits the burst limit and gives up. NixOS atomic rollback solves this by reverting the entire system to the last known-good state. osModa's SafeSwitch mechanism automates this: if the watchdog detects repeated failures after a deployment, it triggers a NixOS generation rollback. Details are on the watchdog auto-restart page.
Layer 3: Memory Management
Memory management is the most underestimated challenge for long-running AI agents. Unlike web servers with short request lifecycles, AI agents accumulate state: growing conversation contexts, cached embeddings, tool results, and intermediate reasoning outputs. Without active management, memory grows monotonically until the agent is killed by the OOM (Out of Memory) killer.
Set hard limits: Always configure memory limits at the process supervisor level. This prevents a leaky agent from consuming all server memory and crashing other services or the OS itself. systemd's MemoryMax and Docker's --memory both enforce cgroup-level limits.
Trim conversation context: Implement a context window strategy. Keep the most recent N messages in memory and summarize or discard older ones. Most agent frameworks support sliding-window context management. For LangGraph agents, use add_messages with a trimming strategy. For custom agents, implement a circular buffer.
Scheduled restarts: Some teams schedule periodic restarts during low-traffic windows (e.g., 4 AM local time) as a pragmatic defense against slow memory leaks that are difficult to diagnose. This is not a substitute for fixing the root cause, but it provides a safety net while you investigate. Configure it with a systemd timer or cron job that runs systemctl restart my-agent on a schedule.
Layer 4: Log Rotation
AI agents are verbose. An active agent generating debug logs can produce 1–10 GB of log output per day. Without rotation, logs fill the disk, which causes the agent (and potentially the entire server) to crash.
systemd + journald: Configure journal size limits in /etc/systemd/journald.conf. Set SystemMaxUse=2G to cap journal storage. Older entries are automatically pruned. Use journalctl -u my-agent --since "1 hour ago" to query recent logs.
Docker: Configure the json-file logging driver with max-size: "100m" and max-file: "5" to automatically rotate at 100 MB and keep the last 5 files (500 MB total). For production systems, consider shipping logs to a centralized service (Datadog, Loki, or CloudWatch) and keeping only minimal local retention.
Layer 5: External Monitoring
Process supervision handles crashes. Health checks handle hangs. But neither answers the question: “Is my agent actually doing its job correctly?” External monitoring fills this gap.
Infrastructure Monitoring
Track CPU, memory, disk usage, and network throughput. Alert when CPU sustains over 90% for 5+ minutes, memory exceeds 80% of limit, or disk usage passes 85%. Tools: Prometheus + Grafana, Datadog, or simple node_exporter with AlertManager.
Application Monitoring
Track agent-specific metrics: tasks completed per hour, average task duration, error rate, and queue depth. Alert on task completion rate dropping below baseline or error rate exceeding threshold. These metrics distinguish a running agent from a productive one.
AI-Specific Monitoring
Track token usage per task (cost control), reasoning loop depth (detect infinite loops), API call patterns (detect runaway tool usage), and output quality metrics (if applicable). Alert on unusual API cost spikes or reasoning loops exceeding expected depth. These are the failure modes unique to AI agents that traditional monitoring misses.
External Health Checking
Run health checks from outside the agent process. A simple cron job or external service (UptimeRobot, Pingdom, or a custom script on a separate machine) that sends HTTP requests to the agent's health endpoint and alerts on failure. The golden rule: the monitor must be independent of the thing it monitors.
How osModa Handles All Five Layers
osModa bundles all five layers into its platform so you do not have to configure each one manually:
Process supervision: 9 Rust daemons manage agent processes with automatic restart and resource isolation.
Crash recovery: The watchdog performs health-check validation with sub-6-second recovery. When a deployment causes repeated failures, SafeSwitch triggers a NixOS atomic rollback to the last known-good state.
Memory management: Resource limits are pre-configured per plan with cgroup enforcement, preventing any single agent from consuming the entire server.
Log management: The SHA-256 audit ledger provides tamper-proof logging of every agent action and tool invocation, with automatic rotation and retention policies.
Monitoring: External health checking is built into the platform infrastructure, independent of the agent processes. Plans start at $14.99/month on dedicated Hetzner servers. See the self-healing servers page for the full architecture, the agent crash debugging guide for common failure modes, or the deployment guide to get started.
Frequently Asked Questions
What is the best way to keep an AI agent running 24/7?
Use a process supervisor that automatically restarts the agent on failure. On bare-metal Linux, systemd is the standard choice — it handles restart policies, resource limits, logging, and boot-time startup. In containerized environments, Docker with --restart=unless-stopped or Kubernetes with liveness probes provides equivalent functionality. osModa's built-in watchdog adds health-check validation on top of process supervision, only restarting when the agent is genuinely unhealthy rather than just restarting blindly.
Why do AI agents crash more than traditional web services?
AI agents have more failure modes than typical web services. They make external API calls to LLM providers that can timeout, return errors, or rate-limit. They maintain complex in-memory state that can corrupt. They execute user-provided or generated code that can segfault or enter infinite loops. They consume growing amounts of memory as conversation context accumulates. And they depend on external tools that may become unavailable. Each of these failure modes requires specific monitoring and recovery strategies.
How should I handle memory leaks in a long-running AI agent?
First, set hard memory limits using systemd's MemoryMax directive or Docker's --memory flag to prevent a leaky agent from consuming all system memory and crashing other services. Second, monitor memory growth over time and set alerts for when usage exceeds 80% of the limit. Third, implement periodic graceful restarts — some teams schedule nightly restarts during low-traffic windows to clear accumulated memory. Fourth, investigate the root cause: common sources include growing conversation contexts, cached model outputs, and unreleased HTTP connections.
What is the difference between a process supervisor and a watchdog?
A process supervisor (systemd, supervisord, Docker) monitors whether a process is running and restarts it if it exits. It operates at the process level — if the process is alive, the supervisor considers it healthy. A watchdog goes deeper: it actively checks whether the agent is functioning correctly by sending health-check requests and validating responses. An agent can be 'alive' (process running) but 'unhealthy' (stuck in a loop, deadlocked, or returning errors). A watchdog catches these cases; a process supervisor alone does not.
How do I set up log rotation for a long-running AI agent?
If using systemd, journald handles log rotation automatically with configurable size limits (SystemMaxUse and RuntimeMaxUse in journald.conf). For file-based logging, configure logrotate with daily or size-based rotation, compression (gzip), and retention policies (keep 14-30 days). In Docker, use the json-file logging driver with --log-opt max-size=100m and --log-opt max-file=5 to automatically rotate container logs. Without log rotation, a chatty AI agent can fill a disk within days.
Should I use Docker or systemd for running AI agents?
Use systemd if your agent runs on a dedicated server or VPS where you control the OS and want minimal overhead. Systemd adds zero abstraction cost, integrates with journald for logging, and provides fine-grained resource controls. Use Docker if you need environment isolation, dependency packaging, or plan to deploy across multiple environments. Docker adds container overhead but ensures your agent runs identically regardless of the host OS. osModa uses NixOS with systemd-based process supervision, combining the isolation benefits of declarative packaging with the minimal overhead of native process management.
How do I monitor an AI agent in production?
Monitor at three levels: infrastructure (CPU, memory, disk, network), application (request latency, error rates, task completion rates), and AI-specific (token usage, API call patterns, reasoning loop depth). Set alerts for: CPU over 90% sustained, memory over 80% of limit, disk over 85% capacity, agent health check failures, and unusual API cost spikes. External monitoring is critical — a dead agent cannot report its own failure. Use an external health checker that runs independently of the agent process.
What happens when an AI agent crashes at 3 AM?
With proper process supervision, the agent restarts automatically — typically within 1-5 seconds for systemd, 5-10 seconds for Docker with health checks. The key question is what happens to in-flight work. Well-designed agents checkpoint their state to persistent storage (database, disk) so they can resume interrupted tasks after restart. Without checkpointing, any work in progress at crash time is lost. osModa's watchdog provides sub-6-second recovery with configurable health checks and NixOS-level rollback for deployment-related crashes.