Crawler¶
crawler
¶
Async web crawler.
Async HTTP crawling with token bucket rate limiting, concurrency control, and robots.txt compliance. Provides Crawler (queue-based async crawler) and RateLimiter (burst-friendly throttling).
Classes¶
RateLimiter
¶
Token bucket rate limiter for burst-friendly rate limiting.
The token bucket algorithm allows for bursts of requests while maintaining an average rate limit over time. Tokens are added to the bucket at a constant rate, and each request consumes one token.
Example
limiter = RateLimiter(rate=2.0, burst=5) await limiter.acquire() # Consumes 1 token
Initialize rate limiter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rate
|
float
|
Requests per second (e.g., 2.0 = 0.5s average delay) |
required |
burst
|
int
|
Maximum burst size (tokens in bucket) |
5
|
CrawlResult
dataclass
¶
CrawlResult(url: str, html: str, status_code: int, content_type: str, links: list[str], assets: list[str])
Result from crawling a single page.
Contains the page content, metadata, and extracted links/assets.
CrawlerStats
dataclass
¶
CrawlerStats(pages_crawled: int = 0, pages_failed: int = 0, assets_discovered: int = 0, total_bytes: int = 0, start_time: float = (lambda: asyncio.get_event_loop().time())(), error_counts: dict[str, int] = dict())
Statistics collected during crawl.
Tracks pages crawled, failures, bytes downloaded, and errors by type.
RobotsTxtChecker
¶
Checks robots.txt files to determine if URLs can be crawled.
Caches robots.txt files per domain to avoid re-fetching. On fetch errors, defaults to allowing the URL (graceful degradation).
Example
checker = RobotsTxtChecker(client, user_agent="MyBot/1.0") allowed = await checker.is_allowed("https://example.com/page")
Initialize robots.txt checker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client
|
AsyncClient
|
HTTP client for fetching robots.txt files |
required |
user_agent
|
str
|
User agent string to use for checking rules |
'SUS/0.1.0'
|
Crawler
¶
Async web crawler with rate limiting and concurrency control.
Features: - Token bucket rate limiting for burst-friendly rate control - Global and per-domain concurrency limits - Exponential backoff retry logic - Dependency injection for testability - Content-type aware handling
Example
config = load_config(Path("config.yaml")) crawler = Crawler(config) async for result in crawler.crawl(): ... print(f"Crawled: {result.url}")
Initialize crawler.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
SusConfig
|
Validated configuration |
required |
client
|
AsyncClient | None
|
Optional HTTP client (for testing with mocks) |
None
|
Functions¶
crawl
async
¶
Crawl pages starting from start_urls.
Implements queue-based crawling with concurrency control. Pages are fetched in parallel up to the configured concurrency limits, and new links are added to the queue as they are discovered.
Yields:
| Type | Description |
|---|---|
AsyncGenerator[CrawlResult, None]
|
CrawlResult for each successfully crawled page |