Crawler¶

crawler ¶

Async web crawler.

Async HTTP crawling with token bucket rate limiting, concurrency control, and robots.txt compliance. Provides Crawler (queue-based async crawler) and RateLimiter (burst-friendly throttling).

Classes¶

RateLimiter ¶

RateLimiter(rate: float, burst: int = 5)

Token bucket rate limiter for burst-friendly rate limiting.

The token bucket algorithm allows for bursts of requests while maintaining an average rate limit over time. Tokens are added to the bucket at a constant rate, and each request consumes one token.

Example

limiter = RateLimiter(rate=2.0, burst=5) await limiter.acquire() # Consumes 1 token

Initialize rate limiter.

Parameters:

Name	Type	Description	Default
`rate`	`float`	Requests per second (e.g., 2.0 = 0.5s average delay)	required
`burst`	`int`	Maximum burst size (tokens in bucket)	`5`

Functions¶

acquire `async` ¶

acquire() -> None

Acquire a token, waiting if necessary.

Implements the token bucket algorithm: 1. Calculate tokens added since last update (time_passed * rate) 2. Add tokens (cap at burst size) 3. If tokens >= 1, consume token and return 4. Otherwise, sleep until next token available

CrawlResult `dataclass` ¶

CrawlResult(url: str, html: str, status_code: int, content_type: str, links: list[str], assets: list[str])

Result from crawling a single page.

Contains the page content, metadata, and extracted links/assets.

CrawlerStats `dataclass` ¶

CrawlerStats(pages_crawled: int = 0, pages_failed: int = 0, assets_discovered: int = 0, total_bytes: int = 0, start_time: float = (lambda: asyncio.get_event_loop().time())(), error_counts: dict[str, int] = dict())

Statistics collected during crawl.

Tracks pages crawled, failures, bytes downloaded, and errors by type.

RobotsTxtChecker ¶

RobotsTxtChecker(client: AsyncClient, user_agent: str = 'SUS/0.1.0')

Checks robots.txt files to determine if URLs can be crawled.

Caches robots.txt files per domain to avoid re-fetching. On fetch errors, defaults to allowing the URL (graceful degradation).

Example

checker = RobotsTxtChecker(client, user_agent="MyBot/1.0") allowed = await checker.is_allowed("https://example.com/page")

Initialize robots.txt checker.

Parameters:

Name	Type	Description	Default
`client`	`AsyncClient`	HTTP client for fetching robots.txt files	required
`user_agent`	`str`	User agent string to use for checking rules	`'SUS/0.1.0'`

Functions¶

is_allowed `async` ¶

is_allowed(url: str) -> bool

Check if URL is allowed by robots.txt.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to check	required

Returns:

Type	Description
`bool`	True if allowed (or on fetch error), False if disallowed

Crawler ¶

Crawler(config: SusConfig, client: AsyncClient | None = None)

Async web crawler with rate limiting and concurrency control.

Features: - Token bucket rate limiting for burst-friendly rate control - Global and per-domain concurrency limits - Exponential backoff retry logic - Dependency injection for testability - Content-type aware handling

Example

config = load_config(Path("config.yaml")) crawler = Crawler(config) async for result in crawler.crawl(): ... print(f"Crawled: {result.url}")

Initialize crawler.

Parameters:

Name	Type	Description	Default
`config`	`SusConfig`	Validated configuration	required
`client`	`AsyncClient \| None`	Optional HTTP client (for testing with mocks)	`None`

Functions¶

crawl `async` ¶

crawl() -> AsyncGenerator[CrawlResult, None]

Crawl pages starting from start_urls.

Implements queue-based crawling with concurrency control. Pages are fetched in parallel up to the configured concurrency limits, and new links are added to the queue as they are discovered.

Yields:

Type	Description
`AsyncGenerator[CrawlResult, None]`	CrawlResult for each successfully crawled page

Crawler¶

crawler ¶

Classes¶

RateLimiter ¶

Functions¶

acquire async ¶

CrawlResult dataclass ¶

CrawlerStats dataclass ¶

RobotsTxtChecker ¶

Functions¶

is_allowed async ¶

Crawler ¶

Functions¶

crawl async ¶

acquire `async` ¶

CrawlResult `dataclass` ¶

CrawlerStats `dataclass` ¶

is_allowed `async` ¶

crawl `async` ¶