Scraper¶

scraper ¶

Scraper pipeline orchestration.

Orchestrates the complete scraping workflow, coordinating crawler, converter, outputs, and assets with Rich progress display. Main entry point: run_scraper().

Classes¶

ScraperStats ¶

Bases: TypedDict

Type definition for scraper statistics dictionary.

PagePreview `dataclass` ¶

PagePreview(url: str, output_path: str, title: str | None = None)

Preview information for a single page.

AssetPreview `dataclass` ¶

AssetPreview(url: str, output_path: str, asset_type: str)

Preview information for a single asset.

PreviewReport `dataclass` ¶

PreviewReport(pages: list[PagePreview] = list(), assets: list[AssetPreview] = list(), total_pages: int = 0, total_assets: int = 0, estimated_bytes: int = 0, config_name: str = '')

Complete preview report for dry-run with --preview flag.

Contains all pages and assets that would be scraped, plus statistics. Exported as JSON for inspection before actual scraping.

Functions¶

run_scraper `async` ¶

run_scraper(config: SusConfig, dry_run: bool = False, max_pages: int | None = None, preview: bool = False) -> dict[str, Any]

Run the complete scraping pipeline.

Parameters:

Name	Type	Description	Default
`config`	`SusConfig`	Validated SusConfig instance	required
`dry_run`	`bool`	If True, don't write files to disk	`False`
`max_pages`	`int \| None`	Maximum number of pages to crawl (None = unlimited)	`None`
`preview`	`bool`	If True, return summary without writing files	`False`

Returns:

Type	Description
`dict[str, Any]`	Dictionary with scraping statistics:
`dict[str, Any]`	pages_crawled: Number of pages successfully crawled
`dict[str, Any]`	pages_failed: Number of pages that failed
`dict[str, Any]`	assets_downloaded: Number of assets downloaded
`dict[str, Any]`	assets_failed: Number of assets that failed
`dict[str, Any]`	total_bytes: Total bytes downloaded
`dict[str, Any]`	execution_time: Time taken in seconds
`dict[str, Any]`	errors: Dict of error types and their occurrences
`dict[str, Any]`	files: List of file paths that were written

Workflow

Initialize components (crawler, converter, output manager, asset downloader)
Setup Rich progress display with two progress bars
Iterate over crawler results:
Convert HTML to Markdown
Rewrite links to relative paths
Save markdown file (skip if dry_run/preview)
Download assets for the page
Update progress bars
Display final summary with statistics and errors
Return summary dict for programmatic access

Scraper¶

scraper ¶

Classes¶

ScraperStats ¶

PagePreview dataclass ¶

AssetPreview dataclass ¶

PreviewReport dataclass ¶

Functions¶

run_scraper async ¶

PagePreview `dataclass` ¶

AssetPreview `dataclass` ¶

PreviewReport `dataclass` ¶

run_scraper `async` ¶