Skip to content

Scraper

scraper

Scraper pipeline orchestration.

Orchestrates the complete scraping workflow, coordinating crawler, converter, outputs, and assets with Rich progress display. Main entry point: run_scraper().

Classes

ScraperStats

Bases: TypedDict

Type definition for scraper statistics dictionary.

PagePreview dataclass

PagePreview(url: str, output_path: str, title: str | None = None)

Preview information for a single page.

AssetPreview dataclass

AssetPreview(url: str, output_path: str, asset_type: str)

Preview information for a single asset.

PreviewReport dataclass

PreviewReport(pages: list[PagePreview] = list(), assets: list[AssetPreview] = list(), total_pages: int = 0, total_assets: int = 0, estimated_bytes: int = 0, config_name: str = '')

Complete preview report for dry-run with --preview flag.

Contains all pages and assets that would be scraped, plus statistics. Exported as JSON for inspection before actual scraping.

Functions

run_scraper async

run_scraper(config: SusConfig, dry_run: bool = False, max_pages: int | None = None, preview: bool = False) -> dict[str, Any]

Run the complete scraping pipeline.

Parameters:

Name Type Description Default
config SusConfig

Validated SusConfig instance

required
dry_run bool

If True, don't write files to disk

False
max_pages int | None

Maximum number of pages to crawl (None = unlimited)

None
preview bool

If True, return summary without writing files

False

Returns:

Type Description
dict[str, Any]

Dictionary with scraping statistics:

dict[str, Any]
  • pages_crawled: Number of pages successfully crawled
dict[str, Any]
  • pages_failed: Number of pages that failed
dict[str, Any]
  • assets_downloaded: Number of assets downloaded
dict[str, Any]
  • assets_failed: Number of assets that failed
dict[str, Any]
  • total_bytes: Total bytes downloaded
dict[str, Any]
  • execution_time: Time taken in seconds
dict[str, Any]
  • errors: Dict of error types and their occurrences
dict[str, Any]
  • files: List of file paths that were written
Workflow
  1. Initialize components (crawler, converter, output manager, asset downloader)
  2. Setup Rich progress display with two progress bars
  3. Iterate over crawler results:
  4. Convert HTML to Markdown
  5. Rewrite links to relative paths
  6. Save markdown file (skip if dry_run/preview)
  7. Download assets for the page
  8. Update progress bars
  9. Display final summary with statistics and errors
  10. Return summary dict for programmatic access