AI Agent Monitoring & Observability

An agent that runs but cannot be observed is an agent that will fail silently. This guide covers every observability layer on osModa: dashboard live logs, agentd health endpoints, osmoda-watch crash detection and auto-recovery, the SHA-256 hash-chained audit ledger, and how to set up real-time alerts via Telegram, Discord, Slack, and WhatsApp.

Last updated: May 2026

Monitoring layers

Layer 1: Dashboard — Live log streaming, agent status, file access, model configuration — all from a web interface.
Layer 2: agentd — Health monitoring, event logging, memory management, backups. The central nervous system of observability.
Layer 3: osmoda-watch — Crash detection, silent hang detection, auto-restart with backoff, SafeSwitch deployment tracking.
Layer 4: Audit ledger — Every mutation hash-chained with SHA-256. Tamper-evident forensic trail.

Dashboard: Live Logs and Agent Management

The osModa dashboard at spawn.os.moda is the first place to check your agent's status. It provides a web interface for agent CRUD operations, live log streaming, file browsing, and configuration management — without needing to SSH into your server.

The dashboard supports multi-model configuration. You can switch your agent between Claude Opus, Sonnet, Haiku, GPT-4o, and o3-mini without redeploying. It also manages multi-channel connectivity — connect your agent to Telegram, WhatsApp, Discord, Slack, or embed it as a web chat widget.

Live log streaming

Watch your agent's stdout and stderr in real time from the browser. Filter by severity, search for patterns, and export logs for external analysis. No SSH required — useful for non-technical team members who need to check agent status.

SSH key management

Add or revoke SSH keys for team members through the web interface. Track which keys have been used and when. Every key change is recorded in the audit ledger.

File access

Browse your server's filesystem, download files, and view agent output without SSH. Useful for checking generated reports, logs, and data files your agent produces.

agentd: The Central Nervous System

agentd is the primary osModa daemon. It handles health monitoring, event logging, memory management (vector + keyword), and backups. It exposes endpoints that the dashboard, osmoda-watch, and external tools consume.

Check agentd status:

# Verify agentd is running
systemctl status agentd

# View recent agentd events
journalctl -u agentd --since "1 hour ago"

# agentd tracks:
# - Agent process health (alive, responsive, resource usage)
# - Events (start, stop, crash, recovery, deploy, rollback)
# - Memory operations (vector store writes, searches)
# - Backup status (last backup time, size, integrity)

agentd events flow into the audit ledger automatically. Every health check result, every memory operation, every backup — all recorded with timestamps and hash-chained for tamper evidence.

You can query agentd for current system status from the command line or integrate its endpoints into your own monitoring tools. The dashboard consumes the same data, so everything you see in the web UI is also available programmatically.

osmoda-watch: How Crash Detection and Auto-Recovery Work

osmoda-watch is the Rust daemon responsible for process supervision, crash detection, and automatic recovery. It goes beyond basic systemd restart in three important ways.

1. Silent hang detection

Systemd only knows if a process has died (exit code). It cannot detect a process that is alive but stuck — deadlocked, blocked on I/O, or spinning without making progress. osmoda-watch performs active health checks that verify actual responsiveness. If a process is alive but unresponsive, it is restarted.

2. Crash-loop backoff

If an agent crashes repeatedly in quick succession (crash loop), blindly restarting it wastes resources and can amplify the problem. osmoda-watch implements exponential backoff — it waits progressively longer between restarts, giving the system time to stabilize. It also logs the crash pattern so you can diagnose the root cause.

3. SafeSwitch deployment tracking

When you deploy a new version of your agent (via nixos-rebuild switch), osmoda-watch tracks whether the new deployment is healthy. If the agent crashes immediately after a deploy, it can trigger an automatic rollback to the previous NixOS generation — reverting not just the agent code but the entire system configuration.

Monitor osmoda-watch in real time:

# Follow osmoda-watch logs
journalctl -u osmoda-watch -f

# Example output during a crash recovery:
# [osmoda-watch] process my-agent (pid 4821) died, exit code 137 (OOM)
# [osmoda-watch] audit: crash logged, hash a3f8c2...
# [osmoda-watch] restarting my-agent (attempt 1)
# [osmoda-watch] process my-agent (pid 4856) started
# [osmoda-watch] health check passed after 3.2s
# [osmoda-watch] recovery complete, resuming normal monitoring

Every osmoda-watch event — detection, restart, health check, rollback — is recorded in the audit ledger. This creates a complete forensic trail of every failure and recovery.

The SHA-256 Hash-Chained Audit Ledger

Every action on an osModa server is recorded in a tamper-evident audit ledger. Each entry contains a timestamp, the action performed, the actor, and a SHA-256 hash that chains to the previous entry. If any entry is modified or deleted, the chain breaks — making tampering detectable.

Audit ledger entry structure:

{
  "sequence": 1847,
  "timestamp": "2026-03-10T14:23:01.442Z",
  "action": "process.crash",
  "actor": "osmoda-watch",
  "target": "my-agent",
  "details": {
    "exit_code": 137,
    "signal": "SIGKILL",
    "reason": "OOM",
    "memory_usage_mb": 3891,
    "uptime_seconds": 14423
  },
  "prev_hash": "a3f8c2e1...d4b7",
  "hash": "7c2d91f3...e8a1"
}

What gets logged:

Process events       → start, stop, crash, recovery, health check
Deployment events    → nixos-rebuild, SafeSwitch, rollback, generation change
Security events      → SSH login, key change, trust tier modification
Memory operations    → vector store writes, keyword index updates
MCP events           → server start, stop, tool registration, connection
Routine execution    → scheduled task runs, event triggers, failures
File mutations       → create, modify, delete (tracked by agentd)
Network events       → mesh connect, disconnect, peer discovery

The audit ledger is essential for compliance (SOC 2, HIPAA) and for post-incident forensics. When something goes wrong, the ledger tells you exactly what happened, when, and in what order — with cryptographic proof that the log has not been tampered with. See the Agent Security guide for how the ledger fits into the broader security model.

Setting Up Alerts via Telegram, Discord, and Slack

Monitoring is useless if nobody sees the alert. osModa supports real-time alerting through the same channels your agent uses for communication. Configure alerts through the dashboard to receive notifications when critical events occur.

The osModa dashboard supports multi-channel connectivity: Telegram, WhatsApp, Discord, Slack, and web chat. Any of these channels can be configured to receive operational alerts alongside regular agent communication.

Telegram alerts

Connect your Telegram account or group through the dashboard. Receive instant notifications for crashes, recoveries, deployment events, and resource warnings. Respond to alerts directly from Telegram — manage your agent from the same chat interface.

Discord and Slack

Route alerts to a dedicated ops channel in your Discord server or Slack workspace. Team members see the same alerts simultaneously. Useful for teams with on-call rotations.

Alert types

Critical: agent crash, OOM kill, deployment failure, audit chain integrity error. Warning: high memory usage, disk space low, crash-loop backoff activated, health check degraded. Info: successful deployment, routine task completed, mesh peer connected.

Command-Line Monitoring via SSH

For engineers who prefer the terminal, everything is accessible via SSH. Here are the essential monitoring commands for your osModa server.

Essential monitoring commands:

# Check all osModa daemon status at once
systemctl status agentd osmoda-watch osmoda-mesh osmoda-mcpd

# Follow your agent logs
journalctl -u my-agent.service -f

# Follow osmoda-watch for crash/recovery events
journalctl -u osmoda-watch -f

# Check resource usage
free -h                    # Memory
df -h                      # Disk
top -bn1 | head -20        # CPU and process overview

# Check NixOS generation (rollback history)
nixos-rebuild list-generations

# View recent audit ledger entries
journalctl -u agentd --since "1 hour ago" | grep audit

Combine SSH monitoring with the dashboard for full observability. Use SSH for deep debugging and the dashboard for at-a-glance status checks and alerting configuration.

Frequently Asked Questions

What is the difference between osmoda-watch and systemd restart?

Systemd restart is binary: the process died, so restart it. osmoda-watch adds health-aware monitoring — it detects silent hangs (process alive but unresponsive), implements crash-loop backoff, performs health checks after restart, and logs every event to the tamper-evident audit ledger. It also handles SafeSwitch atomic deployments and auto-rollback.

Can I send alerts to multiple channels simultaneously?

Yes. You can configure alerts to go to Telegram, Discord, Slack, and WhatsApp simultaneously through the osModa dashboard. Each channel receives the same alert data — crash notifications, recovery confirmations, and health status changes.

How far back does the audit ledger go?

The audit ledger retains all entries for the lifetime of your server. Every entry is SHA-256 hash-chained, making it tamper-evident. You cannot delete or modify past entries without breaking the chain. This makes the ledger suitable for SOC 2 and HIPAA compliance evidence.

Can I access monitoring data via API?

Yes. agentd exposes health and event endpoints that return JSON data. You can integrate these with your own monitoring stack, build custom dashboards, or feed data into third-party observability tools. The dashboard is built on the same API endpoints.

Does monitoring add overhead to my agent?

Minimal. agentd and osmoda-watch are daemons optimized for low resource usage. Health checks are lightweight status queries, not heavy profiling. The audit ledger is append-only with sequential writes. Total overhead is typically under 1% of CPU and less than 50 MB of RAM.

How quickly does osmoda-watch detect a crash?

osmoda-watch detects process death within seconds. For silent hangs — processes that are alive but stuck — detection depends on the health check interval. The default interval catches hangs quickly, and health checks verify actual responsiveness, not just process existence.

Never Miss an Agent Failure Again

osModa servers come with agentd, osmoda-watch, and the audit ledger pre-configured. Set up alerting in the dashboard and your agents are monitored from day one. From $29/month.

Deploy Your Agent Now View Pricing

Explore More Guides

First Agent Agent Security Multi-Agent Architecture MCP Server Setup Cost Optimization NixOS Basics