Scraper¶
scraper
¶
Scraper pipeline orchestration.
Orchestrates the complete scraping workflow, coordinating crawler, converter, outputs, and assets with Rich progress display. Main entry point: run_scraper().
Classes¶
ScraperStats
¶
Bases: TypedDict
Type definition for scraper statistics dictionary.
PagePreview
dataclass
¶
Preview information for a single page.
AssetPreview
dataclass
¶
Preview information for a single asset.
PreviewReport
dataclass
¶
PreviewReport(pages: list[PagePreview] = list(), assets: list[AssetPreview] = list(), total_pages: int = 0, total_assets: int = 0, estimated_bytes: int = 0, config_name: str = '')
Complete preview report for dry-run with --preview flag.
Contains all pages and assets that would be scraped, plus statistics. Exported as JSON for inspection before actual scraping.
Functions¶
run_scraper
async
¶
run_scraper(config: SusConfig, dry_run: bool = False, max_pages: int | None = None, preview: bool = False) -> dict[str, Any]
Run the complete scraping pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
SusConfig
|
Validated SusConfig instance |
required |
dry_run
|
bool
|
If True, don't write files to disk |
False
|
max_pages
|
int | None
|
Maximum number of pages to crawl (None = unlimited) |
None
|
preview
|
bool
|
If True, return summary without writing files |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with scraping statistics: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
Workflow
- Initialize components (crawler, converter, output manager, asset downloader)
- Setup Rich progress display with two progress bars
- Iterate over crawler results:
- Convert HTML to Markdown
- Rewrite links to relative paths
- Save markdown file (skip if dry_run/preview)
- Download assets for the page
- Update progress bars
- Display final summary with statistics and errors
- Return summary dict for programmatic access