Skip to content

URL Rules

rules

URL filtering and crawling rules.

URL normalization, validation, and rule-based filtering for controlling crawl scope. Provides URLNormalizer (URL consistency), RulesEngine (pattern matching), and LinkExtractor (HTML parsing).

Classes

URLNormalizer

Centralized URL normalization and validation.

Utilities for: - Normalizing URLs (lowercase scheme/hostname, remove default ports, strip fragments) - Filtering dangerous schemes (javascript:, data:, file:, etc.) - Handling query parameters (strip or preserve strategies)

Functions

normalize_url staticmethod
normalize_url(url: str) -> str

Normalize URL for consistent handling.

Normalizations applied: - Convert scheme to lowercase (HTTP → http) - Convert hostname to lowercase (Example.COM → example.com) - Remove default ports (http://example.com:80 → http://example.com) - Normalize percent-encoding (%7E → ~) - Remove fragments (#section) - Handle trailing slashes consistently (keeps them)

Parameters:

Name Type Description Default
url str

URL to normalize

required

Returns:

Type Description
str

Normalized URL

Raises:

Type Description
ValueError

If URL is empty or malformed

Examples:

>>> URLNormalizer.normalize_url("HTTP://Example.COM:80/Path")
'http://example.com/Path'
>>> URLNormalizer.normalize_url("https://example.com/path#section")
'https://example.com/path'
>>> URLNormalizer.normalize_url("http://example.com:8080/path")
'http://example.com:8080/path'
filter_dangerous_schemes staticmethod
filter_dangerous_schemes(url: str) -> bool

Return True if URL scheme is safe (http/https), False otherwise.

Parameters:

Name Type Description Default
url str

URL to check

required

Returns:

Type Description
bool

True if scheme is safe, False otherwise

Examples:

>>> URLNormalizer.filter_dangerous_schemes("http://example.com")
True
>>> URLNormalizer.filter_dangerous_schemes("mailto:user@example.com")
False
>>> URLNormalizer.filter_dangerous_schemes("javascript:alert('xss')")
False
handle_query_parameters staticmethod
handle_query_parameters(url: str, strategy: Literal['strip', 'preserve'] = 'strip') -> str

Handle query parameters based on strategy.

Parameters:

Name Type Description Default
url str

URL to process

required
strategy Literal['strip', 'preserve']

"strip" removes all query params, "preserve" keeps them

'strip'

Returns:

Type Description
str

URL with query parameters handled according to strategy

Examples:

>>> URLNormalizer.handle_query_parameters(
...     "http://example.com/path?foo=bar&baz=qux",
...     strategy="strip"
... )
'http://example.com/path'
>>> URLNormalizer.handle_query_parameters(
...     "http://example.com/path?foo=bar",
...     strategy="preserve"
... )
'http://example.com/path?foo=bar'

RulesEngine

RulesEngine(config: SusConfig)

Evaluates crawling rules to determine which URLs to follow.

The RulesEngine applies whitelist/blacklist patterns and depth limits to control which URLs should be crawled.

Initialize with configuration.

Parameters:

Name Type Description Default
config SusConfig

SusConfig containing crawling rules

required

Functions

should_follow
should_follow(url: str, parent_url: str | None = None) -> bool

Determine if URL should be crawled.

Parameters:

Name Type Description Default
url str

URL to evaluate (should be normalized before calling)

required
parent_url str | None

Parent URL that linked to this URL (None for start URLs)

None

Returns:

Type Description
bool

True if URL should be crawled

Logic: 1. Check if allowed domain 2. Check depth limit (if configured) 3. Check exclude patterns (blacklist) - return False if matched 4. Check include patterns (whitelist) - return True if matched, False if no includes

Examples:

>>> config = SusConfig(
...     name="test",
...     site=SiteConfig(
...         start_urls=["http://example.com"],
...         allowed_domains=["example.com"]
...     ),
...     crawling=CrawlingRules(
...         include_patterns=[
...             PathPattern(pattern="/docs/", type="prefix")
...         ],
...         depth_limit=2
...     )
... )
>>> engine = RulesEngine(config)
>>> engine.should_follow("http://example.com/docs/guide", None)
True

LinkExtractor

LinkExtractor(selectors: list[str])

Extracts and normalizes links from HTML.

Handles relative URL resolution, normalization, and filtering of extracted links.

Initialize with CSS selectors for links.

Parameters:

Name Type Description Default
selectors list[str]

List of CSS selectors for extracting links (e.g., ["a[href]", "link[href]"])

required

Examples:

>>> extractor = LinkExtractor(["a[href]"])
>>> extractor = LinkExtractor(["a[href]", "link[href]", "area[href]"])
Note

CSS selectors are converted to XPath internally since lxml doesn't require the cssselect package for XPath.

Functions

extract_links(html: str, base_url: str) -> set[str]

Extract all links from HTML.

Parameters:

Name Type Description Default
html str

HTML content to extract links from

required
base_url str

Base URL for resolving relative links

required

Returns:

Type Description
set[str]

Set of absolute, normalized URLs

Steps: 1. Parse HTML with lxml 2. Extract links using CSS selectors 3. Convert relative to absolute using urllib.parse.urljoin() 4. Normalize each URL using URLNormalizer 5. Remove fragments 6. Filter dangerous schemes 7. Deduplicate

Examples:

>>> html_content = '''
... <html>
...   <a href="/page1">Page 1</a>
...   <a href="http://example.com/page2">Page 2</a>
...   <a href="mailto:user@example.com">Email</a>
... </html>
... '''
>>> extractor = LinkExtractor(["a[href]"])
>>> links = extractor.extract_links(html_content, "http://example.com/")
>>> "http://example.com/page1" in links
True
>>> "mailto:user@example.com" in links
False