AI Agent Keeps Crashing? Fix It With Self-Healing

AI agent crashes in production are not rare exceptions. They are the norm. An analysis of 847 AI agent deployments in 2026 found that 76% failed in production. The leading causes were not exotic edge cases — they were mundane infrastructure problems: memory leaks, dependency conflicts, unhandled API errors, and processes that hung silently with no health check to catch them.

The frustrating part is that most of these crashes are recoverable. The agent just needs to be restarted. Maybe the system state needs to be rolled back to a known-good configuration. But without automated recovery, every crash requires a human: someone to notice, someone to SSH in, someone to diagnose, someone to restart. At 3 AM on a Saturday, that someone is probably not available.

This article breaks down the four most common categories of agent crashes, explains how to diagnose each one, and shows how self-healing infrastructure can handle all of them automatically.

Crash Type 1: Out-of-Memory (OOM) Kills

The out-of-memory killer is the single most common cause of AI agent crashes. When the Linux kernel runs out of available memory, it selects a process to kill based on its memory footprint and OOM score. For AI agents — which often consume the most memory on the server — the OOM killer almost always chooses them.

Why AI Agents Are Especially Vulnerable

AI agents accumulate memory in ways that traditional web services do not. Every conversation turn adds to the context window. Every tool call result is stored in memory. Long-running agents that handle thousands of interactions can grow their memory footprint from 500 MB to 5 GB over the course of a day. Without explicit memory management, this growth is unbounded.

A real-world example from early 2026: a Claude Desktop release shipped with a memory leak that caused memory consumption to balloon from 467 MB to 7.5 GB in just 20 seconds, triggering immediate OOM kills. This was a high-profile case, but subtler memory leaks happen constantly in production agents where context is accumulated over hours or days rather than seconds.

How to Diagnose OOM Kills

dmesg | grep -i "out of memory"

journalctl -k | grep -i oom

cat /proc/meminfo | head -5

If the OOM killer terminated your agent, you will see a kernel message like Out of memory: Killed process [PID] (agent_name) total-vm:5242880kB. The total-vm value tells you how much virtual memory the process was using.

How to Fix It

Set memory limits. Use cgroups or container memory limits to constrain your agent. When the agent hits the limit, it gets killed predictably rather than causing the entire server to swap thrash.

Implement context pruning. Truncate conversation history beyond a sliding window. Summarize old context rather than keeping raw messages. Release tool call results after processing.

Monitor before you crash. Track RSS (Resident Set Size) over time. Alert at 70% of your memory limit so you can take action before the OOM killer does. osModa's watchdog does this automatically, restarting the agent with fresh state when memory exceeds configurable thresholds.

Crash Type 2: Dependency Drift

Your agent was working yesterday. You did not change any code. But now it crashes on startup with a cryptic ImportError or segmentation fault. What happened? A background system update ran overnight.

On Ubuntu servers, the unattended-upgrades package runs by default and automatically installs security updates. This is good practice for security, but it means system libraries can change without warning. If your agent's Python extension modules were compiled against libssl 1.1 and the update installed libssl 3.0, your agent will fail with symbol lookup errors.

Dependency drift also occurs at the Python package level. Unpinned dependencies in requirements.txt (e.g., langchain>=0.2 instead of langchain==0.2.14) can pull in breaking changes when you redeploy or recreate the virtual environment.

How to Diagnose It

cat /var/log/unattended-upgrades/unattended-upgrades.log

apt list --installed 2>/dev/null | grep libssl

pip freeze | diff - requirements.txt

How to Fix It

Pin every dependency. Use lock files. Consider disabling unattended-upgrades for production agent servers (accepting the security trade-off) or at least pinning to specific package versions.

Better yet, use NixOS. On NixOS, the entire system state is declaratively defined. Nothing changes unless you explicitly update the configuration. osModa builds on NixOS, which means dependency drift is structurally impossible — the system state is defined by a Nix flake, and any change creates a new generation that can be rolled back in seconds. If a dependency change does cause a crash, SafeSwitch automatically reverts to the previous configuration. See the self-healing agent servers page for the full architecture.

Crash Type 3: Silent Hangs

Silent hangs are the most insidious type of agent failure because they are invisible to basic monitoring. The process is running. The PID exists. Systemd reports the service as “active (running).” But the agent has stopped processing tasks.

Common Causes of Silent Hangs

Deadlocked event loop. The agent's async event loop is stuck waiting for a lock that will never be released. This is common in agents that mix synchronous and asynchronous code.

Stuck HTTP connection. An API call to an external service (OpenAI, Anthropic, a database) hangs indefinitely because no timeout was set. The agent waits forever for a response that never comes.

Connection pool exhaustion. The database connection pool is full. Every connection is checked out but none are being returned (perhaps due to a leak or a long-running query). New requests block waiting for a connection that will never become available.

Infinite retry loop. The agent encounters an error and retries endlessly with no backoff limit. It is technically “working” — retrying the same failed operation thousands of times per second — but accomplishing nothing.

How to Diagnose It

Silent hangs require active probing, not passive monitoring. Check whether the agent responds to health check endpoints. Verify that the task queue depth is decreasing (tasks are being processed). Look at the agent's thread dump to identify where it is stuck:

kill -SIGUSR1 $(pidof agent) # trigger thread dump (if supported)

strace -p $(pidof agent) -e trace=network # check for stuck syscalls

curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health

How to Fix It

Set timeouts on every external call. Implement a health check endpoint that your monitoring can probe. Use circuit breakers for API calls that may fail.

osModa's watchdog detects silent hangs through application-level health checks. It does not just check if the PID exists — it probes the agent's HTTP health endpoint and monitors behavioral signals like task processing rate. If the agent is alive but not working, the watchdog restarts it. Learn how on the watchdog auto-restart page.

Crash Type 4: Unhandled Exceptions

The classic crash: an input the agent was not programmed to handle triggers an unhandled exception, and the process exits. In Python, this is an uncaught Exception that propagates to the top level. In Node.js, it is an unhandled Promise rejection or an error in a callback without a try/catch.

AI agents are especially prone to unhandled exceptions because they interact with unpredictable external systems. An API returns an unexpected JSON structure. A model generates a response that your parsing logic does not handle. A tool call produces output in a format the agent did not anticipate. The state management for AI agents is a known challenge — Nanonets reported in 2026 that variable and state management failures are among the top reasons agents fail in production.

How to Diagnose It

journalctl -u agent.service --since "1 hour ago" | tail -50

systemctl status agent.service # check exit code

The exit code tells the story. Exit code 1 is a generic error. Signal 9 (SIGKILL) means the OOM killer. Signal 11 (SIGSEGV) means a segmentation fault in a native extension. Check the application logs for the stack trace.

How to Fix It

Wrap your agent's main loop in a top-level exception handler. Log the exception with full stack trace, then either retry the current task or skip it and continue. Never let a single bad input crash the entire process.

# Python example: resilient agent loop
while True:
    try:
        task = queue.get()
        result = agent.process(task)
        queue.ack(task)
    except KeyboardInterrupt:
        break
    except Exception as e:
        logger.error(f"Task failed: {e}", exc_info=True)
        queue.nack(task)  # retry later
        continue

Even with perfect exception handling, unexpected crashes will still happen. The question is whether your infrastructure can recover automatically. osModa's watchdog catches the process exit, logs it to the audit ledger, and restarts the agent within 6 seconds.

The Recovery Matrix: Manual vs. Automated

Every crash type has both a manual fix (someone SSHs in and takes action) and an automated fix (the infrastructure handles it). The difference is downtime.

Crash Type	Manual Recovery	osModa Recovery
OOM Kill	15–45 min	6 seconds
Dependency Drift	30–120 min	6 seconds (rollback)
Silent Hang	Unknown (until noticed)	30 seconds (detected + restarted)
Unhandled Exception	15–30 min	6 seconds

How osModa's Watchdog Auto-Recovers Your Agent

osModa's self-healing stack is designed to handle all four crash types automatically, without any configuration from you. Here is how the recovery pipeline works:

Step 1: Detection (continuous)

The Rust watchdog daemon monitors your agent at three levels: process existence (is the PID alive?), health endpoint response (does /health return 200 within 500ms?), and behavioral signals (is the agent processing tasks?). Checks run every 2 seconds.

Step 2: Triage (500ms)

When a failure is detected, SafeSwitch evaluates the context. First failure with no recent deployment? Simple restart. Repeated crashes after a deployment? Roll back to the previous NixOS generation. Memory-related crash? Restart with fresh state and alert.

Step 3: Recovery (1–5 seconds)

The watchdog either restarts the agent process directly or triggers a NixOS generation rollback (an atomic symlink switch that takes milliseconds). The agent starts with the recovered configuration.

Step 4: Validation (1 second)

The watchdog runs health checks against the restarted agent. If the agent passes all checks, it is marked healthy and resumes processing. If it fails again, SafeSwitch escalates to the next recovery strategy.

Step 5: Audit (immediate)

The entire incident — detection timestamp, failure type, recovery action, validation result — is recorded in the tamper-proof SHA-256 audit ledger. You can review every crash and recovery in the morning.

The total time from failure detection to validated recovery is a median of 6 seconds. Your agent is back online before your monitoring alert even fires. See the full watchdog architecture on the watchdog auto-restart page, or explore the AI agent hosting platform.

Production Crash Prevention Checklist

Before deploying your agent to production, run through this checklist. Each item addresses one of the four crash types covered above.

■Set explicit memory limits (cgroup, container, or ulimit)
■Implement context window pruning for conversation-based agents
■Pin all dependencies with exact versions and lock files
■Disable or control unattended system updates
■Set timeouts on every external HTTP call (5–30 seconds)
■Implement a /health endpoint that validates actual agent function
■Wrap the main loop in a top-level exception handler
■Log exceptions with full stack traces before retrying
■Configure process supervision (systemd, watchdog, or osModa)
■Test your recovery mechanism before going to production

If you want all of this handled for you, osModa provides pre-configured self-healing on every plan. Watchdog monitoring, NixOS rollback, audit logging, and health checks are all included starting at $29/month. Deploy through spawn.os.moda and stop worrying about 3 AM crashes.

Frequently Asked Questions

Why does my AI agent keep crashing?

The most common causes are: out-of-memory (OOM) kills from unbounded context or conversation history, dependency drift after system updates break libraries your agent relies on, silent hangs where the process is alive but not doing work, and unhandled exceptions from rare inputs or API failures. Each cause requires a different fix, but all can be automatically handled by a self-healing server with watchdog monitoring and atomic rollback.

How do I check if my agent was OOM killed?

Check the kernel log with 'dmesg | grep -i oom' or 'journalctl -k | grep -i oom'. If the OOM killer terminated your agent, you will see a message like 'Out of memory: Killed process [PID] (agent_name)'. You can also check the agent's cgroup memory stats in /sys/fs/cgroup/ to see if it hit its memory limit. On osModa, OOM events are automatically logged to the tamper-proof audit ledger with full memory statistics.

How do I prevent dependency drift from crashing my agent?

On traditional Linux servers, pin all package versions in your requirements.txt or package.json and avoid running unattended apt upgrades. Better yet, use NixOS, where the entire system state is declaratively defined and changes require explicit configuration updates. osModa uses NixOS under the hood, so dependency drift is structurally impossible — your system state is defined by a Nix flake, and any change creates a new rollback-able generation.

What is a silent hang and how do I detect it?

A silent hang is when your agent process is still running (the PID exists) but has stopped doing useful work. Common causes include deadlocked event loops, stuck HTTP connections with no timeout, database connection pool exhaustion, and infinite retry loops. Systemd cannot detect silent hangs because it only checks whether the PID is alive. You need application-level health checks that verify the agent is processing tasks, responding to probes, and maintaining normal memory and CPU patterns.

How does osModa's watchdog detect agent crashes?

osModa's Rust watchdog daemon monitors agents at three levels: process-level (is the PID alive?), health-check level (does the HTTP endpoint respond within 500ms?), and behavioral level (has the agent processed a task in the last 30 seconds?). If any check fails three consecutive times, the watchdog triggers recovery — either a simple restart for process crashes or a SafeSwitch rollback for deployment-related failures. The entire detection-to-recovery cycle completes in a median of 6 seconds.

Can I set up crash recovery without osModa?

Yes. Basic process restart is available through systemd's Restart=always directive. For more sophisticated recovery, you can write custom health check scripts, configure systemd watchdog integration, and build your own rollback mechanism. However, building production-grade self-healing (with atomic rollback, crash-loop detection, audit logging, and application-level health checks) typically takes 2-4 weeks of engineering time. osModa provides all of this pre-configured.

My agent crashes only at night. Why?

Night-time crashes often indicate resource exhaustion from accumulated state. Common causes include: memory leaks that only become fatal after hours of operation, log files filling the disk, database connection pools leaking over time, and scheduled system updates (unattended-upgrades on Ubuntu) that restart services or change dependencies. Check 'journalctl --since yesterday' and look for timestamps correlating with the crashes.

How much memory does an AI agent typically need?

It depends on the agent architecture. A lightweight API-calling agent (e.g., calling GPT-4 via HTTP) can run on 512 MB. An agent maintaining conversation context across many sessions might need 2-4 GB. An agent running a local model (e.g., Llama 3 8B) needs 16+ GB. The key is to set explicit memory limits via cgroups or container constraints and monitor actual usage over time, rather than guessing. osModa's watchdog tracks memory usage and alerts before the OOM killer intervenes.

This article breaks down the four most common categories of agent crashes, explains how to diagnose each one, and shows how self-healing infrastructure can handle all of them automatically.

Crash Type 1: Out-of-Memory (OOM) Kills

Why AI Agents Are Especially Vulnerable

How to Diagnose OOM Kills

dmesg | grep -i "out of memory"

journalctl -k | grep -i oom

cat /proc/meminfo | head -5

How to Fix It

Set memory limits. Use cgroups or container memory limits to constrain your agent. When the agent hits the limit, it gets killed predictably rather than causing the entire server to swap thrash.

Implement context pruning. Truncate conversation history beyond a sliding window. Summarize old context rather than keeping raw messages. Release tool call results after processing.

Crash Type 2: Dependency Drift

Your agent was working yesterday. You did not change any code. But now it crashes on startup with a cryptic ImportError or segmentation fault. What happened? A background system update ran overnight.

How to Diagnose It

cat /var/log/unattended-upgrades/unattended-upgrades.log

apt list --installed 2>/dev/null | grep libssl

pip freeze | diff - requirements.txt

How to Fix It

Pin every dependency. Use lock files. Consider disabling unattended-upgrades for production agent servers (accepting the security trade-off) or at least pinning to specific package versions.

Crash Type 3: Silent Hangs

Common Causes of Silent Hangs

Deadlocked event loop. The agent's async event loop is stuck waiting for a lock that will never be released. This is common in agents that mix synchronous and asynchronous code.

Stuck HTTP connection. An API call to an external service (OpenAI, Anthropic, a database) hangs indefinitely because no timeout was set. The agent waits forever for a response that never comes.

How to Diagnose It

kill -SIGUSR1 $(pidof agent) # trigger thread dump (if supported)

strace -p $(pidof agent) -e trace=network # check for stuck syscalls

curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health

How to Fix It

Set timeouts on every external call. Implement a health check endpoint that your monitoring can probe. Use circuit breakers for API calls that may fail.

Crash Type 4: Unhandled Exceptions

How to Diagnose It

journalctl -u agent.service --since "1 hour ago" | tail -50

systemctl status agent.service # check exit code

How to Fix It

# Python example: resilient agent loop
while True:
    try:
        task = queue.get()
        result = agent.process(task)
        queue.ack(task)
    except KeyboardInterrupt:
        break
    except Exception as e:
        logger.error(f"Task failed: {e}", exc_info=True)
        queue.nack(task)  # retry later
        continue

The Recovery Matrix: Manual vs. Automated

Every crash type has both a manual fix (someone SSHs in and takes action) and an automated fix (the infrastructure handles it). The difference is downtime.

Crash Type	Manual Recovery	osModa Recovery
OOM Kill	15–45 min	6 seconds
Dependency Drift	30–120 min	6 seconds (rollback)
Silent Hang	Unknown (until noticed)	30 seconds (detected + restarted)
Unhandled Exception	15–30 min	6 seconds

How osModa's Watchdog Auto-Recovers Your Agent

osModa's self-healing stack is designed to handle all four crash types automatically, without any configuration from you. Here is how the recovery pipeline works:

Step 1: Detection (continuous)

Step 2: Triage (500ms)

Step 3: Recovery (1–5 seconds)

The watchdog either restarts the agent process directly or triggers a NixOS generation rollback (an atomic symlink switch that takes milliseconds). The agent starts with the recovered configuration.

Step 4: Validation (1 second)

Step 5: Audit (immediate)

Production Crash Prevention Checklist

Before deploying your agent to production, run through this checklist. Each item addresses one of the four crash types covered above.

■Set explicit memory limits (cgroup, container, or ulimit)
■Implement context window pruning for conversation-based agents
■Pin all dependencies with exact versions and lock files
■Disable or control unattended system updates
■Set timeouts on every external HTTP call (5–30 seconds)
■Implement a /health endpoint that validates actual agent function
■Wrap the main loop in a top-level exception handler
■Log exceptions with full stack traces before retrying
■Configure process supervision (systemd, watchdog, or osModa)
■Test your recovery mechanism before going to production