Template — architecture pattern, not a starter kit
1
Cron-scheduled scrape

osmoda-routines triggers scraping jobs on your defined schedule.

2
Parse and store

Extract structured data and persist to database or filesystem.

3
Alert on new data

Sends Telegram notification when new or changed data is detected.

Deploy This PatternRecommended: Solo · $14.99/mo

Web Scraper Agent Template

This template describes the architecture for a scheduled web scraping agent on osModa. The agent runs on a cron schedule via osmoda-routines, scrapes target websites through the osmoda-egress domain allowlisting proxy, parses and stores extracted data, and sends Telegram alerts when new or changed data is detected. osmoda-watch ensures the scraper recovers from crashes automatically.

This is an architecture pattern, not a downloadable scraping tool. It describes which osModa daemons your scraper would use, how data flows from source websites through parsing to storage and alerting, and how to handle untrusted content safely with Tier 2 trust. You bring your own scraping library (Puppeteer, Playwright, Scrapy, or raw HTTP requests) and build the agent following this pattern on your osModa server.

TL;DR

  • • Cron-scheduled scraping via osmoda-routines -- define any schedule (hourly, daily, weekly)
  • • Domain allowlisting via osmoda-egress -- agent can only reach pre-approved domains
  • • Crash recovery via osmoda-watch -- scraper restarts automatically, resumes from checkpoint
  • • Tier 2 trust for safe processing of untrusted external content
  • • Telegram alerts when new or changed data is detected
  • • Recommended plan: Solo ($14.99/mo)

Architecture Diagram

The data flow for a scheduled web scraper agent on osModa.

┌──────────────────────────────────────────┐
│         osmoda-routines (CRON)           │
│  triggers scrape on defined schedule     │
└──────────────────┬───────────────────────┘
                   ▼
┌──────────────────────────────────────────┐
│         SCRAPER PROCESS                  │
│  (your agent code)                       │
│  supervised by osmoda-watch              │
│  runs at Tier 2 trust                    │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│         osmoda-egress                    │
│  domain allowlisting proxy              │
│  only pre-approved domains pass         │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│         TARGET WEBSITES                  │
│  fetch HTML/JSON/API responses           │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│         PARSER                           │
│  extract structured data                 │
│  compare with previous results           │
└──────────┬───────────────┬───────────────┘
           │               │
           ▼               ▼
┌────────────────┐  ┌─────────────────────┐
│  DATABASE /    │  │  DIFF DETECTOR      │
│  FILESYSTEM    │  │  new data found?    │
│  store results │  └─────────┬───────────┘
└────────────────┘            │
                        ┌─────┴─────┐
                        ▼           ▼
                  ┌──────────┐ ┌─────────┐
                  │ NEW DATA │ │ NO DIFF  │
                  │ alert    │ │ skip     │
                  │ Telegram │ │          │
                  └──────────┘ └─────────┘

┌──────────────────────────────────────────┐
│  AUDIT LEDGER (agentd)                   │
│  logs every scrape run, results, errors  │
└──────────────────────────────────────────┘

Components

The building blocks of this scraper architecture.

Cron Scheduler

osmoda-routines triggers the scraper on a cron schedule you define. Supports standard cron expressions. Failed runs are logged and the next scheduled run proceeds normally.

Domain Allowlist Proxy

osmoda-egress ensures the scraper only reaches pre-approved domains. Blocks requests to unauthorized URLs. Prevents exploitation via redirects or embedded content that references untrusted resources.

Content Parser

Your code that extracts structured data from raw HTML, JSON, or API responses. Runs at Tier 2 trust to sandbox untrusted content processing. Compares results with previous scrapes to detect changes.

Data Store

Stores scraped results in a database (SQLite, PostgreSQL) or filesystem. Previous results are kept for diff comparison. The persistent filesystem survives process restarts and server reboots.

Alert System

When the diff detector finds new or changed data, sends a notification via Telegram with the key changes. Can also alert via Slack, Discord, or email depending on your configuration.

Crash Recovery

osmoda-watch supervises the scraper process. If it crashes mid-scrape, the watchdog restarts it. Checkpoint-based recovery lets the scraper resume from where it left off instead of starting over.

osModa Features Used

The specific daemons and platform capabilities this template relies on.

R

osmoda-routines

Cron scheduler and event-driven task runner. Triggers scraping jobs on your defined schedule. Handles job lifecycle, failure logging, and scheduling conflicts.

E

osmoda-egress

Domain allowlisting proxy. Only allows outbound requests to domains you have explicitly approved. Essential for scraping agents that process untrusted content.

W

osmoda-watch

Process supervision with auto-restart. If the scraper crashes due to unexpected HTML structure, network timeouts, or memory issues, osmoda-watch restarts it.

T

Tier 2 Trust

Additional sandboxing for processing untrusted content. Web pages can contain malicious payloads or injection attempts. Tier 2 trust limits what the content parser can do with the data, reducing security risk.

Step-by-Step Setup

How to implement this architecture pattern on your osModa server.

  1. 1

    Spawn a Solo server and SSH in

    Go to spawn.os.moda and create a Solo server ($14.99/mo). SSH in with your key. All 9 daemons including osmoda-routines and osmoda-egress are already running.

  2. 2

    Configure the domain allowlist

    Add the target domains you want to scrape to the osmoda-egress allowlist. Only requests to these domains will be permitted. Block everything else.

  3. 3

    Build the scraper and parser

    Write your scraping code using Puppeteer, Playwright, Scrapy, or raw HTTP. Build the parser to extract structured data from the responses. Implement diff detection to compare with previous results.

  4. 4

    Register the scraper with osmoda-watch

    Register the scraper process with osmoda-watch for crash recovery. Configure restart policies (immediate restart or exponential backoff).

  5. 5

    Schedule the cron job via osmoda-routines

    Define your scraping schedule using a cron expression. osmoda-routines will trigger the scraper at the specified times and log each run.

  6. 6

    Connect Telegram for alerts

    Configure a Telegram bot token and chat ID. When the diff detector finds new data, the agent sends a formatted notification with the changes.

Recommended Plan

Web scraping is primarily I/O-bound. Most time is spent waiting for HTTP responses, not processing data. Solo handles the vast majority of scraping workloads.

Solo — $14.99/mo

2 CPU · 4 GB RAM · 40 GB disk

Sufficient for most scraping workloads including daily multi-site scrapes, data diffing, and Telegram alerting. If running headless browsers for JavaScript-rendered pages, consider Pro ($34.99/mo) for extra memory.

Frequently Asked Questions

Is this template a downloadable scraping tool?

No. This is an architecture pattern describing how to design a scheduled web scraper agent on osModa. It outlines the data flow, the daemons involved (osmoda-routines for scheduling, osmoda-egress for domain allowlisting, osmoda-watch for crash recovery), and the recommended plan. You write the scraping code yourself using any library (Puppeteer, Playwright, Scrapy, etc.) and deploy it on your osModa server following this pattern.

How does domain allowlisting work with osmoda-egress?

osmoda-egress is a proxy daemon that controls which external domains your agent can reach. You define an allowlist of domains the scraper is permitted to access. Any request to a domain not on the list is blocked. This prevents your agent from being exploited to access unauthorized resources, which is especially important when processing untrusted content that might contain redirects or embedded references to other domains.

What is Tier 2 trust and why does this template use it?

osModa's trust model has three tiers: Tier 0 (unrestricted), Tier 1 (standard), and Tier 2 (restricted). Tier 2 is designed for handling untrusted content — data from external websites that could contain malicious payloads, injection attempts, or unexpected formats. Running your scraper at Tier 2 trust applies additional sandboxing to content processing, reducing the risk of scraped content affecting the agent's behavior or compromising the server.

How does the cron scheduling work for scraping jobs?

osmoda-routines supports standard cron expressions and event-driven triggers. You define a cron schedule (e.g., every 6 hours, daily at midnight, every Monday) and osmoda-routines executes your scraping job at the specified times. If a job fails, the failure is logged to the audit ledger and the next scheduled run proceeds normally. You can also trigger scrapes manually or based on external events.

What happens if the scraper crashes mid-run?

osmoda-watch detects the crash and restarts the scraper process. For long-running scrapes that process many pages, you can implement checkpoint-based recovery: the scraper saves its progress (last page processed, items collected so far) to disk, and on restart, resumes from the last checkpoint instead of starting over. The crash and restart are logged to the audit ledger.

What plan is recommended for a web scraper agent?

Solo ($14.99/mo, 2 CPU, 4 GB RAM, 40 GB disk) is sufficient for most scraping workloads. Web scraping is primarily I/O-bound (waiting for HTTP responses), so it does not require heavy compute. If you are running headless browsers (Puppeteer/Playwright) for JavaScript-rendered pages, or scraping at very high volume, consider Pro ($34.99/mo) for the additional memory.

Build Your Scraper on osModa

Spawn a dedicated server with osmoda-routines for scheduling, osmoda-egress for domain control, and osmoda-watch for crash recovery. From $14.99/month.

Last updated: March 2026