Skip to content

Crawler

crawler

Async web crawler.

Async HTTP crawling with token bucket rate limiting, concurrency control, and robots.txt compliance. Provides Crawler (queue-based async crawler) and RateLimiter (burst-friendly throttling).

Classes

RateLimiter

RateLimiter(rate: float, burst: int = 5)

Token bucket rate limiter for burst-friendly rate limiting.

The token bucket algorithm allows for bursts of requests while maintaining an average rate limit over time. Tokens are added to the bucket at a constant rate, and each request consumes one token.

Example

limiter = RateLimiter(rate=2.0, burst=5) await limiter.acquire() # Consumes 1 token

Initialize rate limiter.

Parameters:

Name Type Description Default
rate float

Requests per second (e.g., 2.0 = 0.5s average delay)

required
burst int

Maximum burst size (tokens in bucket)

5

Functions

acquire async
acquire() -> None

Acquire a token, waiting if necessary.

Implements the token bucket algorithm: 1. Calculate tokens added since last update (time_passed * rate) 2. Add tokens (cap at burst size) 3. If tokens >= 1, consume token and return 4. Otherwise, sleep until next token available

CrawlResult dataclass

CrawlResult(url: str, html: str, status_code: int, content_type: str, links: list[str], assets: list[str])

Result from crawling a single page.

Contains the page content, metadata, and extracted links/assets.

CrawlerStats dataclass

CrawlerStats(pages_crawled: int = 0, pages_failed: int = 0, assets_discovered: int = 0, total_bytes: int = 0, start_time: float = (lambda: asyncio.get_event_loop().time())(), error_counts: dict[str, int] = dict())

Statistics collected during crawl.

Tracks pages crawled, failures, bytes downloaded, and errors by type.

RobotsTxtChecker

RobotsTxtChecker(client: AsyncClient, user_agent: str = 'SUS/0.1.0')

Checks robots.txt files to determine if URLs can be crawled.

Caches robots.txt files per domain to avoid re-fetching. On fetch errors, defaults to allowing the URL (graceful degradation).

Example

checker = RobotsTxtChecker(client, user_agent="MyBot/1.0") allowed = await checker.is_allowed("https://example.com/page")

Initialize robots.txt checker.

Parameters:

Name Type Description Default
client AsyncClient

HTTP client for fetching robots.txt files

required
user_agent str

User agent string to use for checking rules

'SUS/0.1.0'

Functions

is_allowed async
is_allowed(url: str) -> bool

Check if URL is allowed by robots.txt.

Parameters:

Name Type Description Default
url str

URL to check

required

Returns:

Type Description
bool

True if allowed (or on fetch error), False if disallowed

Crawler

Crawler(config: SusConfig, client: AsyncClient | None = None)

Async web crawler with rate limiting and concurrency control.

Features: - Token bucket rate limiting for burst-friendly rate control - Global and per-domain concurrency limits - Exponential backoff retry logic - Dependency injection for testability - Content-type aware handling

Example

config = load_config(Path("config.yaml")) crawler = Crawler(config) async for result in crawler.crawl(): ... print(f"Crawled: {result.url}")

Initialize crawler.

Parameters:

Name Type Description Default
config SusConfig

Validated configuration

required
client AsyncClient | None

Optional HTTP client (for testing with mocks)

None

Functions

crawl async
crawl() -> AsyncGenerator[CrawlResult, None]

Crawl pages starting from start_urls.

Implements queue-based crawling with concurrency control. Pages are fetched in parallel up to the configured concurrency limits, and new links are added to the queue as they are discovered.

Yields:

Type Description
AsyncGenerator[CrawlResult, None]

CrawlResult for each successfully crawled page