Skip to content

Why This Matters

Untrusted text carries invisible attacks. Homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, and template injection delimiters bypass validation, poison templates, fool humans, and evade detection. navi-sanitize removes these vectors before text reaches your application.

Use Cases

LLM Prompt Pipelines

User input flows into system prompts, RAG context, and tool calls. The attack surface is broad:

  • Tag block smuggling --- Unicode Tag characters (U+E0001-U+E007F) encode invisible ASCII that tokenizers read but humans can't see. An attacker can embed "ignore previous instructions" in text that appears blank.
  • Homoglyph filter bypass --- Keyword blocklists checking for "system" miss "ѕуѕtеm" (Cyrillic ѕ, у, е).
  • Zero-width injection --- Invisible characters inserted between tokens alter how models parse input without visible change.

navi-sanitize strips all of these before text reaches the model. The pluggable escaper lets you add vendor-specific prompt escaping on top.

from navi_sanitize import clean

# Strip invisible prompt injection before sending to LLM
sanitized_input = clean(user_message)

# With a custom prompt escaper
clean(user_message, escaper=my_prompt_escaper)

Web Applications

Jinja2 SSTI, path traversal, and fullwidth encoding bypasses are well-known but tedious to cover manually:

  • Homoglyph-disguised SSTI --- {{ cоnfig }} uses Cyrillic о (U+043E). A naive Jinja2 escaper sees ASCII { but the interior contains non-Latin characters, potentially confusing downstream processing. navi-sanitize normalizes before escaping.
  • Fullwidth bypass --- \uff7b\uff6f\uff73\uff72\uff9e renders as fullwidth katakana but NFKC normalizes to ASCII that may match dangerous patterns.
  • Path traversal + null byte --- "../\x00../../etc/passwd" combines null byte truncation with directory traversal. The pipeline removes null bytes first, then the path escaper strips traversal sequences.
from navi_sanitize import clean, jinja2_escaper, path_escaper

# Single call handles homoglyphs + template escaping
clean("{{ cоnfig }}", escaper=jinja2_escaper)

# Path sanitization after null byte removal
clean("../\x00../../etc/passwd", escaper=path_escaper)

Config and Data Ingestion

YAML, TOML, and JSON parsed from untrusted sources can carry:

  • Null bytes that truncate C-extension string processing, causing key mismatches
  • Zero-width characters that break key matching ("api\u200b_key" looks like "api_key" but doesn't match)
  • Homoglyphs that create near-duplicate keys ("аpi_key" with Cyrillic а coexists with "api_key")

walk() sanitizes every string in a nested structure in one call:

from navi_sanitize import walk

# Sanitize entire parsed config
config = walk(yaml.safe_load(untrusted_yaml))

# With escaper for template configs
config = walk(parsed_json, escaper=jinja2_escaper)

Log Analysis and SIEM

Attackers embed invisible characters in log entries to evade detection:

  • Bidi overrides reorder displayed text, hiding malicious content from analysts reviewing logs visually
  • Zero-width characters break grep patterns ("admin" won't match "adm\u200bin")
  • Tag block characters encode invisible metadata that log aggregation tools may index but analysts can't see

Sanitizing log data on ingest ensures what you search is what's actually there:

from navi_sanitize import clean

def sanitize_log_entry(entry: str) -> str:
    return clean(entry)

Identity and Anti-Phishing

pаypal.com (Cyrillic а) renders identically to paypal.com in most fonts. This applies to:

  • Display names in messaging and email
  • URLs in link previews and bookmarks
  • Email addresses in sender fields
  • Usernames in registration systems

Homoglyph replacement normalizes these to catch spoofing that visual inspection misses:

from navi_sanitize import clean

# Normalize display name before showing to user
display_name = clean(untrusted_name)

# Normalize URL for comparison
if clean(submitted_url) != submitted_url:
    flag_as_suspicious(submitted_url)

When users can create content that others search:

  • Homoglyphs create duplicate entries that appear identical but don't match queries
  • Invisible characters cause full-text search misses
  • NFKC normalization ensures that "cafe" and "café" (with combining acute accent) are consistently stored

Sanitize at the write boundary:

from navi_sanitize import clean

# Normalize before storing
record["title"] = clean(user_input)

Integration Points

navi-sanitize is designed to compose with existing tools:

Layer Tool How They Compose
Validation pydantic, cerberus Call clean() in an AfterValidator or coercion chain
HTML escaping MarkupSafe, nh3 navi-sanitize normalizes characters; MarkupSafe escapes HTML entities
Encoding repair ftfy Run ftfy first to fix mojibake, then navi-sanitize to strip evasion vectors
Template engines Jinja2 Use jinja2_escaper and Jinja2's autoescape=True for defense in depth
Web frameworks Django, Flask, FastAPI Sanitize at the request boundary in middleware or dependency injection
LLM frameworks LangChain, LlamaIndex Sanitize user input before it enters the prompt pipeline