Skip to content

Performance

Benchmarks measured on Python 3.12, single thread. Run via uv run pytest tests/test_benchmark.py -v.

Benchmark Results

clean() --- Per-String Cost

Scenario Mean Ops/sec Description
Short, clean text (no-op) 2.8 us 358K ~38 chars, no stages fire
Short, hostile (all stages) 67 us 15K ~27 chars with homoglyphs, null bytes, zero-width, template syntax
13KB clean text 810 us 1.2K Large clean input throughput
10KB hostile text 449 us 2.2K Large hostile input with repeated attack patterns
100KB hostile payload 5.7 ms 176 Stress test payload

walk() --- Recursive Structure Cost

Scenario Mean Ops/sec Description
100-item nested dict, clean 537 us 1.9K deepcopy + traversal overhead, no stages fire
100-item nested dict, hostile 6.9 ms 144 deepcopy + full pipeline on every string

When to Use clean() vs walk()

Situation Use
Single user input field clean()
JSON request body walk()
Individual form fields already extracted clean() on each
Nested config from untrusted source walk()
Hot path, single known string clean()

walk() adds deepcopy overhead to ensure the original data is never modified. If you're already working with a copy or don't need immutability, you can call clean() on individual strings for better performance.

Performance Characteristics by Stage

Stage Cost Profile Notes
Null bytes O(n) str.replace --- very fast
Invisible chars O(n) Single compiled regex --- fast
NFKC normalization O(n) unicodedata.normalize --- C implementation
Homoglyphs O(n) Character-by-character dict lookup --- fast for short strings, linear for long
Escaper Varies Depends on escaper implementation

All stages are O(n) in string length. The pipeline makes a single pass per stage (5 passes total). The dominant cost for clean text is the invisible character regex and NFKC normalization (the regex findall check and unicodedata.normalize still scan the full string).

Tips for Hot Paths

Batch at the boundary: Sanitize input once when it enters your system, not on every use. Store the sanitized version.

Skip walk() when possible: If you know the structure of your data, calling clean() on specific fields avoids deepcopy overhead.

Pre-check with is_ascii(): If you know your input is pure ASCII, you can skip sanitization entirely --- none of the universal stages modify ASCII text (except null bytes, which are rare in text input).

def sanitize_if_needed(text: str, **kwargs) -> str:
    if text.isascii() and "\x00" not in text:
        return text
    return clean(text, **kwargs)

Escaper cost: The universal stages are fixed-cost. If your custom escaper is expensive, that's where optimization efforts should focus.

Running Benchmarks

# Run all benchmarks
uv run pytest tests/test_benchmark.py -v

# Run only clean() benchmarks
uv run pytest tests/test_benchmark.py -v -k "clean"

# Run only walk() benchmarks
uv run pytest tests/test_benchmark.py -v -k "walk"

Benchmarks use pytest-benchmark. The 100KB payload test uses pedantic() mode (50 rounds, 5 warmup) to avoid excessive iterations.