API Reference¶
Complete API documentation auto-generated from source code docstrings.
All modules are fully typed with mypy --strict compliance and Google-style docstring documentation.
Core Modules¶
Configuration Layer¶
- config - Configuration system with Pydantic models and YAML loading
- exceptions - Custom exception hierarchy
Crawling & URL Handling Layer¶
- crawler - Async web crawler with rate limiting and concurrency control
- rules - URL filtering, normalization, and link extraction
Content Processing Layer¶
- converter - HTML to Markdown conversion with frontmatter
- outputs - File path mapping and link rewriting
- assets - Concurrent asset downloading
Orchestration Layer¶
Utilities¶
- utils - Shared utility functions
Usage Patterns¶
All modules follow consistent patterns:
- Type Safety: Full type hints with mypy --strict compliance
- Async Support: Async/await patterns using httpx and asyncio
- Documentation: Google-style docstrings with examples and type annotations
- Pydantic Validation: Configuration validated with Pydantic error messages
Getting Started with the API¶
For most use cases, you'll interact with:
- config.load_config() - Load YAML configuration
- Crawler - Async web crawler
- run_scraper() - Main orchestration function
Architecture Overview¶
SUS implements a six-phase pipeline architecture:
- Configuration System (
config.py) - Pydantic 2.9+ models with YAML validation - Crawler Engine (
crawler.py) - httpx async client with token bucket rate limiter - URL Filtering (
rules.py) - lxml-based link extraction with pattern matching - Content Conversion (
converter.py) - markdownify HTML parser with frontmatter - CLI Interface (
cli.py) - Typer commands with Rich progress bars - Testing (
tests/) - 139+ pytest tests with mypy --strict compliance
See individual module documentation for detailed API references.