Performance¶

Benchmarks measured on Python 3.12, single thread. Run via uv run pytest tests/test_benchmark.py -v.

Benchmark Results¶

`clean()` --- Per-String Cost¶

Scenario	Mean	Ops/sec	Description
Short, clean text (no-op)	2.8 us	358K	~38 chars, no stages fire
Short, hostile (all stages)	67 us	15K	~27 chars with homoglyphs, null bytes, zero-width, template syntax
13KB clean text	810 us	1.2K	Large clean input throughput
10KB hostile text	449 us	2.2K	Large hostile input with repeated attack patterns
100KB hostile payload	5.7 ms	176	Stress test payload

`walk()` --- Recursive Structure Cost¶

Scenario	Mean	Ops/sec	Description
100-item nested dict, clean	537 us	1.9K	`deepcopy` + traversal overhead, no stages fire
100-item nested dict, hostile	6.9 ms	144	`deepcopy` + full pipeline on every string

When to Use `clean()` vs `walk()`¶

Situation	Use
Single user input field	`clean()`
JSON request body	`walk()`
Individual form fields already extracted	`clean()` on each
Nested config from untrusted source	`walk()`
Hot path, single known string	`clean()`

walk() adds deepcopy overhead to ensure the original data is never modified. If you're already working with a copy or don't need immutability, you can call clean() on individual strings for better performance.

Performance Characteristics by Stage¶

Stage	Cost Profile	Notes
Null bytes	O(n)	`str.replace` --- very fast
Invisible chars	O(n)	Single compiled regex --- fast
NFKC normalization	O(n)	`unicodedata.normalize` --- C implementation
Homoglyphs	O(n)	Character-by-character dict lookup --- fast for short strings, linear for long
Escaper	Varies	Depends on escaper implementation

All stages are O(n) in string length. The pipeline makes a single pass per stage (5 passes total). The dominant cost for clean text is the invisible character regex and NFKC normalization (the regex findall check and unicodedata.normalize still scan the full string).

Tips for Hot Paths¶

Batch at the boundary: Sanitize input once when it enters your system, not on every use. Store the sanitized version.

Skip walk() when possible: If you know the structure of your data, calling clean() on specific fields avoids deepcopy overhead.

Pre-check with is_ascii(): If you know your input is pure ASCII, you can skip sanitization entirely --- none of the universal stages modify ASCII text (except null bytes, which are rare in text input).

def sanitize_if_needed(text: str, **kwargs) -> str:
    if text.isascii() and "\x00" not in text:
        return text
    return clean(text, **kwargs)

Escaper cost: The universal stages are fixed-cost. If your custom escaper is expensive, that's where optimization efforts should focus.

Running Benchmarks¶

# Run all benchmarks
uv run pytest tests/test_benchmark.py -v

# Run only clean() benchmarks
uv run pytest tests/test_benchmark.py -v -k "clean"

# Run only walk() benchmarks
uv run pytest tests/test_benchmark.py -v -k "walk"

Benchmarks use pytest-benchmark. The 100KB payload test uses pedantic() mode (50 rounds, 5 warmup) to avoid excessive iterations.