Skip to content

API Reference

All public symbols are exported from navi_sanitize:

from navi_sanitize import (
    clean, walk, jinja2_escaper, path_escaper, Escaper,
    decode_evasion, detect_scripts, is_mixed_script,
)

clean(text, *, escaper=None)

Sanitize a single string through the universal pipeline.

Parameters:

Parameter Type Default Description
text str (required) The string to sanitize
escaper Escaper \| None None Optional escaper function applied as the final stage

Returns: str --- the sanitized string.

Raises: TypeError --- if text is not a str, or if the escaper returns a non-str.

Stages (in order): 1. Null byte removal 2. Invisible character stripping 3. NFKC normalization 4. Homoglyph replacement 5. Re-NFKC (if homoglyphs were replaced --- ensures idempotency) 6. Escaper (if provided)

Always returns output. Logs warnings when input is modified.

Examples:

from navi_sanitize import clean, jinja2_escaper

# No-op for clean text
clean("hello world")  # "hello world"

# All stages fire
clean("n\u0430vi\x00\u200b")  # "navi"

# With escaper
clean("{{ config }}", escaper=jinja2_escaper)  # "\\{\\{ config \\}\\}"

# TypeError on non-string
clean(42)  # TypeError: clean() requires str, got int

walk(data, *, escaper=None)

Recursively sanitize every string in a dict/list/nested structure.

Uses PEP 695 generic syntax: def walk[T](data: T, *, escaper=None) -> T

Parameters:

Parameter Type Default Description
data T (required) Any Python object; strings within dicts/lists are sanitized
escaper Escaper \| None None Optional escaper function applied to each string

Returns: T --- a deep copy of the input with all strings sanitized.

Behavior by type:

Type Behavior
str Passed through clean()
dict Both keys and values sanitized recursively
list Elements sanitized recursively
tuple, set, bytes, int, float, bool, None Passed through unchanged

The original data is never modified --- walk() operates on a deepcopy.

Examples:

from navi_sanitize import walk

# Nested structure
result = walk({
    "user": "pаypal",       # Cyrillic а → Latin a
    "tags": ["te\u200bst"], # zero-width space removed
    "count": 42             # int passes through
})
# {"user": "paypal", "tags": ["test"], "count": 42}

# Dict keys are also sanitized
walk({"\u0430dmin": "value"})  # {"admin": "value"}

# Non-dict/list input
walk("he\x00llo")  # "hello"
walk(42)            # 42

Escaper

Type alias for escaper functions.

Escaper = Callable[[str], str]

Any function that accepts a str and returns a str can be used as an escaper. The escaper runs as the final pipeline stage, after all universal stages have completed. The escaper's output is not re-sanitized.


jinja2_escaper(text)

Escape Jinja2 template delimiters in a string.

Parameters:

Parameter Type Description
text str The string to escape

Returns: str --- the string with Jinja2 delimiters backslash-escaped.

What it escapes: - {{ and }} --- expression delimiters - {% and %} --- statement delimiters - {# and #} --- comment delimiters - Runs of 2+ braces ({{{, }}}) --- handles triple-brace edge cases

Uses a single-pass regex: \{{2,}|\}{2,}|\{%|%\}|\{#|#\}

Each character in a matched delimiter is individually backslash-escaped.

Examples:

from navi_sanitize import jinja2_escaper

jinja2_escaper("{{ config }}")     # "\\{\\{ config \\}\\}"
jinja2_escaper("{% import os %}")  # "\\{\\% import os \\%\\}"
jinja2_escaper("{# comment #}")    # "\\{\\# comment \\#\\}"
jinja2_escaper("{{{ triple }}}")   # "\\{\\{\\{ triple \\}\\}\\}"
jinja2_escaper("no delimiters")    # "no delimiters"

path_escaper(text)

Remove path traversal sequences from a string.

Parameters:

Parameter Type Description
text str The path string to sanitize

Returns: str --- the path with traversal sequences removed.

Algorithm: 1. Replace backslashes with forward slashes 2. Strip leading / 3. Split on / 4. Remove .. and . segments 5. Remove embedded .. within segments (handles null-byte concatenation artifacts) 6. Rejoin non-empty segments

Examples:

from navi_sanitize import path_escaper

path_escaper("../../../etc/passwd")    # "etc/passwd"
path_escaper("/etc/passwd")            # "etc/passwd"
path_escaper("foo/../../../bar")       # "foo/bar"
path_escaper("..\\..\\windows\\cmd")   # "windows/cmd"
path_escaper("safe/path/file.txt")     # "safe/path/file.txt"

Opt-in Utilities

These functions are not part of clean() and are never run automatically. They are standalone primitives you compose with the pipeline yourself.


decode_evasion(text, *, max_layers=3)

Iteratively decode nested URL, HTML entity, and hex escape encodings from a string.

Parameters:

Parameter Type Default Description
text str (required) The string to decode
max_layers int 3 Maximum decoding passes before stopping

Returns: str --- the decoded string.

Behavior: - Runs URL decoding → HTML entity unescaping → hex escape decoding (\xHH) per pass - A pass counts as one layer if the output differs from the input - Stops when a pass produces no change or max_layers is reached - max_layers <= 0 is a no-op (returns text unchanged) - Invalid or partial encodings do not raise --- they pass through unchanged - Logs a warning with the layer count when decoding occurs; never includes decoded content in log messages

Examples:

from navi_sanitize import decode_evasion, clean, path_escaper

# Single layer of URL encoding
decode_evasion("%2e%2e%2fetc%2fpasswd")       # "../etc/passwd"

# Double-encoded (two layers)
decode_evasion("%252e%252e%252fetc%252fpasswd")  # "../../etc/passwd"

# HTML entities
decode_evasion("&lt;script&gt;")              # "<script>"

# Hex escapes
decode_evasion("\\x41\\x42\\x43")             # "ABC"

# Compose with clean()
raw = "%252e%252e%252fetc%252fpasswd"
clean(decode_evasion(raw), escaper=path_escaper)  # "etc/passwd"

# No-op when max_layers <= 0
decode_evasion("%41", max_layers=0)            # "%41"

detect_scripts(text)

Return the set of script buckets present in a string.

Parameters:

Parameter Type Description
text str The string to analyze

Returns: set[str] --- script bucket names found in the text.

Buckets:

Bucket Covers
latin Latin script characters
cyrillic Cyrillic script characters
greek Greek script characters
arabic Arabic script characters
hebrew Hebrew script characters
armenian Armenian script characters
cherokee Cherokee script characters
cjk CJK Unified, Hiragana, Katakana, and Hangul

Only the listed buckets are returned. Characters whose Unicode name doesn't match any known prefix are silently ignored. Non-alphabetic characters (digits, punctuation, emoji) are skipped.

Examples:

from navi_sanitize import detect_scripts

detect_scripts("hello world")   # {"latin"}
detect_scripts("Привет")        # {"cyrillic"}
detect_scripts("pаypal.com")   # {"latin", "cyrillic"} — Cyrillic а
detect_scripts("12345!@#")      # set() — no alphabetic chars
detect_scripts("")              # set()

is_mixed_script(text)

Return True if the text contains characters from two or more scripts.

Parameters:

Parameter Type Description
text str The string to check

Returns: bool --- True when 2+ script buckets are detected.

Non-alphabetic characters (digits, punctuation, emoji) are not counted, so "hello 123" is not considered mixed.

Examples:

from navi_sanitize import is_mixed_script

is_mixed_script("hello world")   # False — Latin only
is_mixed_script("pаypal.com")   # True — Latin + Cyrillic
is_mixed_script("Ꭺdmin")        # True — Cherokee + Latin
is_mixed_script("12345")         # False — no alphabetic chars