URL Rules¶

rules ¶

URL filtering and crawling rules.

URL normalization, validation, and rule-based filtering for controlling crawl scope. Provides URLNormalizer (URL consistency), RulesEngine (pattern matching), and LinkExtractor (HTML parsing).

Classes¶

URLNormalizer ¶

Centralized URL normalization and validation.

Utilities for: - Normalizing URLs (lowercase scheme/hostname, remove default ports, strip fragments) - Filtering dangerous schemes (javascript:, data:, file:, etc.) - Handling query parameters (strip or preserve strategies)

Functions¶

normalize_url `staticmethod` ¶

normalize_url(url: str) -> str

Normalize URL for consistent handling.

Normalizations applied: - Convert scheme to lowercase (HTTP → http) - Convert hostname to lowercase (Example.COM → example.com) - Remove default ports (http://example.com:80 → http://example.com) - Normalize percent-encoding (%7E → ~) - Remove fragments (#section) - Handle trailing slashes consistently (keeps them)

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to normalize	required

Returns:

Type	Description
`str`	Normalized URL

Raises:

Type	Description
`ValueError`	If URL is empty or malformed

Examples:

>>> URLNormalizer.normalize_url("HTTP://Example.COM:80/Path")
'http://example.com/Path'

>>> URLNormalizer.normalize_url("https://example.com/path#section")
'https://example.com/path'

>>> URLNormalizer.normalize_url("http://example.com:8080/path")
'http://example.com:8080/path'

filter_dangerous_schemes `staticmethod` ¶

filter_dangerous_schemes(url: str) -> bool

Return True if URL scheme is safe (http/https), False otherwise.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to check	required

Returns:

Type	Description
`bool`	True if scheme is safe, False otherwise

Examples:

>>> URLNormalizer.filter_dangerous_schemes("http://example.com")
True

>>> URLNormalizer.filter_dangerous_schemes("mailto:user@example.com")
False

>>> URLNormalizer.filter_dangerous_schemes("javascript:alert('xss')")
False

handle_query_parameters `staticmethod` ¶

handle_query_parameters(url: str, strategy: Literal['strip', 'preserve'] = 'strip') -> str

Handle query parameters based on strategy.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to process	required
`strategy`	`Literal['strip', 'preserve']`	"strip" removes all query params, "preserve" keeps them	`'strip'`

Returns:

Type	Description
`str`	URL with query parameters handled according to strategy

Examples:

>>> URLNormalizer.handle_query_parameters(
...     "http://example.com/path?foo=bar&baz=qux",
...     strategy="strip"
... )
'http://example.com/path'

>>> URLNormalizer.handle_query_parameters(
...     "http://example.com/path?foo=bar",
...     strategy="preserve"
... )
'http://example.com/path?foo=bar'

RulesEngine ¶

RulesEngine(config: SusConfig)

Evaluates crawling rules to determine which URLs to follow.

The RulesEngine applies whitelist/blacklist patterns and depth limits to control which URLs should be crawled.

Initialize with configuration.

Parameters:

Name	Type	Description	Default
`config`	`SusConfig`	SusConfig containing crawling rules	required

Functions¶

should_follow ¶

should_follow(url: str, parent_url: str | None = None) -> bool

Determine if URL should be crawled.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to evaluate (should be normalized before calling)	required
`parent_url`	`str \| None`	Parent URL that linked to this URL (None for start URLs)	`None`

Returns:

Type	Description
`bool`	True if URL should be crawled

Logic: 1. Check if allowed domain 2. Check depth limit (if configured) 3. Check exclude patterns (blacklist) - return False if matched 4. Check include patterns (whitelist) - return True if matched, False if no includes

Examples:

>>> config = SusConfig(
...     name="test",
...     site=SiteConfig(
...         start_urls=["http://example.com"],
...         allowed_domains=["example.com"]
...     ),
...     crawling=CrawlingRules(
...         include_patterns=[
...             PathPattern(pattern="/docs/", type="prefix")
...         ],
...         depth_limit=2
...     )
... )
>>> engine = RulesEngine(config)
>>> engine.should_follow("http://example.com/docs/guide", None)
True

LinkExtractor ¶

LinkExtractor(selectors: list[str])

Extracts and normalizes links from HTML.

Handles relative URL resolution, normalization, and filtering of extracted links.

Initialize with CSS selectors for links.

Parameters:

Name	Type	Description	Default
`selectors`	`list[str]`	List of CSS selectors for extracting links (e.g., ["a[href]", "link[href]"])	required

Examples:

>>> extractor = LinkExtractor(["a[href]"])
>>> extractor = LinkExtractor(["a[href]", "link[href]", "area[href]"])

Note

CSS selectors are converted to XPath internally since lxml doesn't require the cssselect package for XPath.

Functions¶

extract_links ¶

extract_links(html: str, base_url: str) -> set[str]

Extract all links from HTML.

Parameters:

Name	Type	Description	Default
`html`	`str`	HTML content to extract links from	required
`base_url`	`str`	Base URL for resolving relative links	required

Returns:

Type	Description
`set[str]`	Set of absolute, normalized URLs

Steps: 1. Parse HTML with lxml 2. Extract links using CSS selectors 3. Convert relative to absolute using urllib.parse.urljoin() 4. Normalize each URL using URLNormalizer 5. Remove fragments 6. Filter dangerous schemes 7. Deduplicate

Examples:

>>> html_content = '''
... <html>
...   <a href="/page1">Page 1</a>
...   <a href="http://example.com/page2">Page 2</a>
...   <a href="mailto:user@example.com">Email</a>
... </html>
... '''
>>> extractor = LinkExtractor(["a[href]"])
>>> links = extractor.extract_links(html_content, "http://example.com/")
>>> "http://example.com/page1" in links
True
>>> "mailto:user@example.com" in links
False

URL Rules¶

rules ¶

Classes¶

URLNormalizer ¶

Functions¶

normalize_url staticmethod ¶

filter_dangerous_schemes staticmethod ¶

handle_query_parameters staticmethod ¶

RulesEngine ¶

Functions¶

should_follow ¶

LinkExtractor ¶

Functions¶

extract_links ¶

normalize_url `staticmethod` ¶

filter_dangerous_schemes `staticmethod` ¶

handle_query_parameters `staticmethod` ¶