SUS - Simple Universal Scraper¶
Async documentation scraper for converting websites to Markdown format. Built with Python 3.12+, httpx, and asyncio.
What is SUS?¶
SUS (Simple Universal Scraper) is a config-driven web scraper for converting documentation websites to Markdown format with preserved assets. Built with Python 3.12+ using httpx and asyncio, it controls crawling through YAML configuration files with regex/glob/prefix pattern matching, token bucket rate limiting, and dual-level concurrency controls.
Key features:
- httpx async HTTP client with asyncio for concurrent page fetching
- Pydantic 2.9+ validated YAML configuration files
- Token bucket rate limiting (configurable req/s with burst capacity)
- Dual concurrency: global (10) + per-domain (2) connection limits
- markdownify-based HTML → Markdown with YAML frontmatter
- Link rewriting to relative paths calculated by directory depth
- Concurrent asset downloads (images, CSS, JS) with SHA-256 deduplication
- Rich terminal UI with real-time crawl statistics and progress tracking
Quick Start¶
Installation¶
# Clone the repository
git clone <repo-url>
cd sus
# Install dependencies with uv
uv sync
# Verify installation
uv run sus --version
Your First Scrape¶
# Scrape with example config (limit to 10 pages for testing)
uv run sus scrape --config examples/aptly.yaml --max-pages 10
# Full scrape (no page limit)
uv run sus scrape --config examples/aptly.yaml
Create Your Own Config¶
# Interactive configuration wizard
uv run sus init my-config.yaml
# Validate your config
uv run sus validate my-config.yaml
# Run the scraper
uv run sus scrape --config my-config.yaml
Documentation Structure¶
This documentation is organized into three main sections:
User Guide¶
- Configuration Guide - Learn how to configure scrapers with YAML files
- CLI Reference - Command-line interface documentation
- Crawler Guide - Understanding the crawling engine
API Reference¶
Complete API documentation auto-generated from source code docstrings. See the API Overview for a full module listing.
Development¶
For contributors and developers:
- Architecture - System design and implementation phases
- Contributing - How to contribute to the project
- Testing - Running tests and type checking
Use Cases¶
- Offline documentation mirrors with relative links and preserved assets
- Documentation archival for compliance and auditing
- Legacy HTML documentation conversion to Markdown format
- Custom documentation processing pipelines with configurable output structure
Requirements¶
- Python 3.12 or higher
- uv (recommended) or pip
License¶
This project is currently unlicensed. Please contact the maintainer for licensing information.
Core Dependencies¶
- httpx 0.28+ - HTTP/2 async client for page fetching
- Pydantic 2.9+ - YAML config validation with type coercion
- Typer 0.15+ - CLI argument parsing and command routing
- Rich 14+ - Terminal progress bars and formatted output
- markdownify 0.14+ - HTML to Markdown parser
- lxml 5.3+ - Fast HTML parsing for link extraction