URL Rules¶
rules
¶
URL filtering and crawling rules.
URL normalization, validation, and rule-based filtering for controlling crawl scope. Provides URLNormalizer (URL consistency), RulesEngine (pattern matching), and LinkExtractor (HTML parsing).
Classes¶
URLNormalizer
¶
Centralized URL normalization and validation.
Utilities for: - Normalizing URLs (lowercase scheme/hostname, remove default ports, strip fragments) - Filtering dangerous schemes (javascript:, data:, file:, etc.) - Handling query parameters (strip or preserve strategies)
Functions¶
normalize_url
staticmethod
¶
Normalize URL for consistent handling.
Normalizations applied: - Convert scheme to lowercase (HTTP → http) - Convert hostname to lowercase (Example.COM → example.com) - Remove default ports (http://example.com:80 → http://example.com) - Normalize percent-encoding (%7E → ~) - Remove fragments (#section) - Handle trailing slashes consistently (keeps them)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL to normalize |
required |
Returns:
| Type | Description |
|---|---|
str
|
Normalized URL |
Raises:
| Type | Description |
|---|---|
ValueError
|
If URL is empty or malformed |
Examples:
filter_dangerous_schemes
staticmethod
¶
Return True if URL scheme is safe (http/https), False otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL to check |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if scheme is safe, False otherwise |
Examples:
handle_query_parameters
staticmethod
¶
Handle query parameters based on strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL to process |
required |
strategy
|
Literal['strip', 'preserve']
|
"strip" removes all query params, "preserve" keeps them |
'strip'
|
Returns:
| Type | Description |
|---|---|
str
|
URL with query parameters handled according to strategy |
Examples:
RulesEngine
¶
Evaluates crawling rules to determine which URLs to follow.
The RulesEngine applies whitelist/blacklist patterns and depth limits to control which URLs should be crawled.
Initialize with configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
SusConfig
|
SusConfig containing crawling rules |
required |
Functions¶
should_follow
¶
Determine if URL should be crawled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL to evaluate (should be normalized before calling) |
required |
parent_url
|
str | None
|
Parent URL that linked to this URL (None for start URLs) |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if URL should be crawled |
Logic: 1. Check if allowed domain 2. Check depth limit (if configured) 3. Check exclude patterns (blacklist) - return False if matched 4. Check include patterns (whitelist) - return True if matched, False if no includes
Examples:
>>> config = SusConfig(
... name="test",
... site=SiteConfig(
... start_urls=["http://example.com"],
... allowed_domains=["example.com"]
... ),
... crawling=CrawlingRules(
... include_patterns=[
... PathPattern(pattern="/docs/", type="prefix")
... ],
... depth_limit=2
... )
... )
>>> engine = RulesEngine(config)
>>> engine.should_follow("http://example.com/docs/guide", None)
True
LinkExtractor
¶
Extracts and normalizes links from HTML.
Handles relative URL resolution, normalization, and filtering of extracted links.
Initialize with CSS selectors for links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
selectors
|
list[str]
|
List of CSS selectors for extracting links (e.g., ["a[href]", "link[href]"]) |
required |
Examples:
>>> extractor = LinkExtractor(["a[href]"])
>>> extractor = LinkExtractor(["a[href]", "link[href]", "area[href]"])
Note
CSS selectors are converted to XPath internally since lxml doesn't require the cssselect package for XPath.
Functions¶
extract_links
¶
Extract all links from HTML.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
str
|
HTML content to extract links from |
required |
base_url
|
str
|
Base URL for resolving relative links |
required |
Returns:
| Type | Description |
|---|---|
set[str]
|
Set of absolute, normalized URLs |
Steps: 1. Parse HTML with lxml 2. Extract links using CSS selectors 3. Convert relative to absolute using urllib.parse.urljoin() 4. Normalize each URL using URLNormalizer 5. Remove fragments 6. Filter dangerous schemes 7. Deduplicate
Examples:
>>> html_content = '''
... <html>
... <a href="/page1">Page 1</a>
... <a href="http://example.com/page2">Page 2</a>
... <a href="mailto:user@example.com">Email</a>
... </html>
... '''
>>> extractor = LinkExtractor(["a[href]"])
>>> links = extractor.extract_links(html_content, "http://example.com/")
>>> "http://example.com/page1" in links
True
>>> "mailto:user@example.com" in links
False