Comparison with Other Tools¶
navi-sanitize is the only library that combines invisible character stripping, homoglyph replacement, NFKC normalization, and pluggable escaping in a single zero-dependency pipeline. This page explains how it relates to existing tools.
Overview¶
| navi-sanitize | Unidecode / anyascii | confusable_homoglyphs | ftfy | MarkupSafe / nh3 | pydantic | |
|---|---|---|---|---|---|---|
| Purpose | Security sanitization | ASCII transliteration | Homoglyph detection | Encoding repair | HTML escaping | Schema validation |
| Invisible chars | Strips 492 (bidi, tag block, ZW, VS, C0/C1) | Incidental | No | Partial (preserves bidi, ZW, VS) | No | No |
| Homoglyphs | Replaces 66 curated pairs | Transliterates all non-ASCII | Detects only | No | No | No |
| NFKC | Yes | No | No | NFC (NFKC optional) | No | No |
| Null bytes | Yes | No | No | No | No | No |
| Preserves Unicode | Yes | No | Yes | Yes | Yes | Yes |
| Pluggable escaper | Yes | No | No | No | N/A | N/A |
| Dependencies | Zero | Zero | Zero | wcwidth | C ext / Rust ext | Rust ext |
Detailed Comparisons¶
vs. Unidecode / anyascii¶
What they do: Transliterate all Unicode to ASCII via lookup tables. Every non-ASCII code point maps to an ASCII string.
Why they're different: Transliteration destroys content. Unidecode turns Chinese characters into pinyin, Cyrillic sentences into romanized gibberish, and Arabic into Latin approximations. It's designed for slug generation, not security.
navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji, and non-confusable Cyrillic pass through unchanged.
| Input | navi-sanitize | Unidecode |
|---|---|---|
pаypal.com (Cyrillic а) |
paypal.com |
paypal.com |
"hello" (smart quotes) |
"hello" |
"hello" |
안녕하세요 (Korean) |
안녕하세요 |
annyeonghaseyo |
漢字 (CJK) |
漢字 |
Han Zi |
hello 🎉 (emoji) |
hello 🎉 |
hello |
vs. confusable_homoglyphs¶
What it does: Detects confusable characters using the Unicode Consortium's official Confusables.txt dataset. Returns analysis of which characters are confusable and from which scripts.
Why it's different: Detection-only API --- it tells you what's confusable but doesn't replace anything. You'd need to build your own replacement layer on top. The library is also archived upstream.
navi-sanitize's 66-pair map is intentionally curated for the highest-risk Latin lookalikes (Cyrillic, Greek, Armenian, Cherokee, Latin Extended, and typographic) rather than the full Unicode confusables set. This keeps the pipeline fast and avoids false positives from low-risk script pairs.
vs. ftfy¶
What it does: Fixes encoding corruption (mojibake), HTML entities in wrong contexts, curly-quote issues, and C1 control characters. Applies NFC normalization by default.
Why it's different: ftfy and navi-sanitize solve different problems. ftfy repairs accidental encoding damage; navi-sanitize strips intentional evasion vectors.
Critically, ftfy explicitly preserves bidi overrides (U+202A-U+202E), zero-width characters (U+200B-U+200D), variation selectors, and Tag block characters. Its design philosophy is "don't remove characters that might be intentional." navi-sanitize's design philosophy is "remove characters that can be weaponized."
They compose well: Run ftfy first to fix mojibake, then navi-sanitize to strip evasion vectors.
import ftfy
from navi_sanitize import clean
text = ftfy.fix_text(raw_input) # Fix encoding corruption
text = clean(text) # Strip evasion vectors
vs. MarkupSafe / nh3 / bleach¶
What they do: HTML escaping (MarkupSafe) and HTML tag/attribute sanitization (nh3, bleach).
Why they're different: These operate at the HTML structure layer --- escaping <, >, & and stripping dangerous tags/attributes. navi-sanitize operates at the character content layer --- normalizing the text inside those HTML elements.
They address orthogonal attack surfaces and compose naturally:
from markupsafe import escape
from navi_sanitize import clean
# navi-sanitize normalizes characters, MarkupSafe escapes HTML
safe_html = escape(clean(user_input))
Note: bleach is deprecated (html5lib dependency unmaintained). Use nh3 for HTML sanitization.
vs. pydantic / cerberus¶
What they do: Schema validation and data coercion for Python data structures.
Why they're different: Validation frameworks check format constraints (types, lengths, patterns) but don't sanitize character-level content. Pydantic's strip_whitespace strips leading/trailing whitespace but doesn't touch invisible Unicode, homoglyphs, or null bytes.
navi-sanitize is designed to plug into these frameworks:
from typing import Annotated
from pydantic import AfterValidator, BaseModel
from navi_sanitize import clean
SanitizedStr = Annotated[str, AfterValidator(clean)]
class UserInput(BaseModel):
name: SanitizedStr
bio: SanitizedStr
vs. python-slugify¶
What it does: Converts Unicode text to URL-safe ASCII slugs via transliteration, lowercasing, and non-alphanumeric stripping.
Why it's different: Destructive by design --- produces ASCII slugs, not sanitized original text. Its purpose is URL generation, not security sanitization. Also depends on text-unidecode.
What navi-sanitize Does Not Replace¶
- HTML sanitizers (nh3, DOMPurify) --- use these for HTML tag/attribute filtering
- SQL parameterization --- use prepared statements, not string escaping
- URL validation --- use
validatorsor framework URL parsing - Encoding repair --- use ftfy for mojibake and encoding corruption
- Full transliteration --- use Unidecode/anyascii when you need ASCII-only output
Design Choices¶
Curated homoglyph map vs. full Unicode confusables¶
The Unicode Consortium's Confusables.txt contains thousands of pairs across many scripts. navi-sanitize uses a curated 66-pair subset focused on:
- Highest visual similarity --- characters that are pixel-identical in common fonts
- Most commonly weaponized --- Cyrillic/Greek-to-Latin pairs used in phishing and filter bypass
- Typographic normalization --- smart quotes, em/en dashes, minus signs
This preserves legitimate Unicode while covering the attack surface that matters in practice.
Strip vs. detect¶
Libraries like confusable_homoglyphs provide detection APIs that return analysis of what's confusable. This is useful for research and alerting but requires application code to decide what to do.
navi-sanitize takes an opinionated approach: confusable characters are replaced, invisible characters are stripped, and the pipeline always produces clean output. The application never needs to handle "this input might be dangerous" --- it's already been fixed.
Zero dependencies¶
navi-sanitize uses only the Python standard library. This means:
- No supply chain risk from transitive dependencies
- No C extensions to compile (pure Python, runs anywhere CPython does)
- No version conflicts with other packages
- Trivial to vendor into locked-down environments