Character Reference
Complete tables of all characters handled by navi-sanitize's pipeline.
Invisible Characters (492 total)
Zero-Width Characters (8)
| Codepoint |
Name |
Why Stripped |
| U+200B |
Zero-width space |
Invisible word boundary; breaks token matching |
| U+200C |
Zero-width non-joiner |
Invisible modifier; evasion in keyword filters |
| U+200D |
Zero-width joiner |
Invisible modifier; used in emoji but exploited in text |
| U+200E |
Left-to-right mark |
Invisible directional control |
| U+200F |
Right-to-left mark |
Invisible directional control |
| U+2060 |
Word joiner |
Invisible no-break hint; evasion vector |
| U+FEFF |
BOM / zero-width no-break space |
Byte order mark; invisible at start or middle of string |
| U+180E |
Mongolian vowel separator |
Invisible separator; historically used for evasion |
| Codepoint |
Name |
Why Stripped |
| U+00AD |
Soft hyphen |
Invisible in most renderers; breaks word matching |
| U+034F |
Combining grapheme joiner |
Invisible modifier between characters |
| U+2009 |
Thin space |
Near-invisible space variant |
| U+200A |
Hair space |
Near-invisible space variant (thinnest) |
| U+2028 |
Line separator |
Invisible line break; XSS vector in JavaScript |
| U+2029 |
Paragraph separator |
Invisible paragraph break; XSS vector in JavaScript |
| U+FFF9 |
Interlinear annotation anchor |
Invisible annotation control |
| U+FFFA |
Interlinear annotation separator |
Invisible annotation control |
| U+FFFB |
Interlinear annotation terminator |
Invisible annotation control |
| U+FFFC |
Object replacement character |
Placeholder for embedded object; invisible in text |
| U+2061 |
Function application |
Invisible math operator |
| U+2062 |
Invisible times |
Invisible math operator |
| U+2063 |
Invisible separator |
Invisible math operator; hides keywords |
| U+2064 |
Invisible plus |
Invisible math operator |
| U+206A |
Inhibit symmetric swapping |
Deprecated format control |
| U+206B |
Activate symmetric swapping |
Deprecated format control |
| U+206C |
Inhibit Arabic form shaping |
Deprecated format control |
| U+206D |
Activate Arabic form shaping |
Deprecated format control |
| U+206E |
National digit shapes |
Deprecated format control |
| U+206F |
Nominal digit shapes |
Deprecated format control |
| U+2800 |
Braille pattern blank |
Renders as empty space |
| U+1680 |
Ogham space mark |
Invisible space variant |
| U+115F |
Hangul Choseong filler |
Invisible hangul filler |
| U+1160 |
Hangul Jungseong filler |
Invisible hangul filler |
| U+3164 |
Hangul filler |
NFKC-normalizes to U+1160; pre-stripped |
| U+FFA0 |
Halfwidth hangul filler |
NFKC-normalizes to U+1160; pre-stripped |
| U+061C |
Arabic letter mark |
Invisible directional mark |
Mongolian Free Variation Selectors (4)
| Codepoint |
Name |
Why Stripped |
| U+180B |
Mongolian FVS one |
Invisible glyph modifier; functionally identical to VS1--VS16 |
| U+180C |
Mongolian FVS two |
Invisible glyph modifier |
| U+180D |
Mongolian FVS three |
Invisible glyph modifier |
| U+180F |
Mongolian FVS four |
Invisible glyph modifier |
Variation Selectors (16)
| Range |
Name |
Why Stripped |
| U+FE00--U+FE0F |
VS1--VS16 |
Invisible glyph presentation modifiers; alter character display without visible change in most contexts |
Unicode Tag Block (128)
| Range |
Name |
Why Stripped |
| U+E0000 |
Language tag (deprecated) |
Deprecated; invisible |
| U+E0001--U+E007F |
Tag characters |
Invisible ASCII encoding; used in tag smuggling attacks where tokenizers read content that humans cannot see |
Variation Selector Supplement (240)
| Range |
Name |
Why Stripped |
| U+E0100--U+E01EF |
VS17--VS256 |
Extended invisible glyph modifiers for CJK ideographic variation |
Bidirectional Control Characters (9)
| Codepoint |
Name |
Why Stripped |
| U+202A |
Left-to-right embedding |
Reorders text display to hide content |
| U+202B |
Right-to-left embedding |
Reorders text display to hide content |
| U+202C |
Pop directional formatting |
Terminates bidi override |
| U+202D |
Left-to-right override |
Forces display direction; hides malicious content |
| U+202E |
Right-to-left override |
Reverses displayed text order |
| U+2066 |
Left-to-right isolate |
Isolates text direction; evasion in mixed content |
| U+2067 |
Right-to-left isolate |
Isolates text direction; evasion in mixed content |
| U+2068 |
First strong isolate |
Auto-detects direction; unpredictable display |
| U+2069 |
Pop directional isolate |
Terminates bidi isolate |
C0 Control Characters (28)
| Range |
Why Stripped |
| U+0001--U+0008 |
Includes BEL (U+0007), BS (U+0008, terminal overwrite attack) |
| U+000B--U+000C |
Vertical tab, form feed |
| U+000E--U+001F |
Includes ESC (U+001B, ANSI injection); excludes TAB (U+0009), LF (U+000A), CR (U+000D) |
C1 Control Characters (32)
| Range |
Why Stripped |
| U+0080--U+009F |
Includes CSI (U+009B, equivalent to ESC+[) and NEL (U+0085); invisible in all modern contexts |
Homoglyph Map (66 pairs)
Cyrillic to Latin --- Lowercase (11 pairs)
| Cyrillic |
Codepoint |
Latin |
Visual |
| а |
U+0430 |
a |
а → a |
| е |
U+0435 |
e |
е → e |
| о |
U+043E |
o |
о → o |
| р |
U+0440 |
p |
р → p |
| с |
U+0441 |
c |
с → c |
| у |
U+0443 |
y |
у → y |
| х |
U+0445 |
x |
х → x |
| і |
U+0456 |
i |
і → i |
| ѕ |
U+0455 |
s |
ѕ → s |
| ј |
U+0458 |
j |
ј → j |
| һ |
U+04BB |
h |
һ → h |
Cyrillic to Latin --- Uppercase (14 pairs)
| Cyrillic |
Codepoint |
Latin |
Visual |
| А |
U+0410 |
A |
А → A |
| В |
U+0412 |
B |
В → B |
| Е |
U+0415 |
E |
Е → E |
| К |
U+041A |
K |
К → K |
| М |
U+041C |
M |
М → M |
| Н |
U+041D |
H |
Н → H |
| О |
U+041E |
O |
О → O |
| Р |
U+0420 |
P |
Р → P |
| С |
U+0421 |
C |
С → C |
| Т |
U+0422 |
T |
Т → T |
| Х |
U+0425 |
X |
Х → X |
| І |
U+0406 |
I |
І → I |
| Ѕ |
U+0405 |
S |
Ѕ → S |
| Ј |
U+0408 |
J |
Ј → J |
Greek to Latin --- Uppercase (14 pairs)
| Greek |
Codepoint |
Latin |
Visual |
| Α |
U+0391 |
A |
Α → A |
| Β |
U+0392 |
B |
Β → B |
| Ε |
U+0395 |
E |
Ε → E |
| Ζ |
U+0396 |
Z |
Ζ → Z |
| Η |
U+0397 |
H |
Η → H |
| Ι |
U+0399 |
I |
Ι → I |
| Κ |
U+039A |
K |
Κ → K |
| Μ |
U+039C |
M |
Μ → M |
| Ν |
U+039D |
N |
Ν → N |
| Ο |
U+039F |
O |
Ο → O |
| Ρ |
U+03A1 |
P |
Ρ → P |
| Τ |
U+03A4 |
T |
Τ → T |
| Υ |
U+03A5 |
Y |
Υ → Y |
| Χ |
U+03A7 |
X |
Χ → X |
Greek to Latin --- Lowercase (11 pairs)
| Greek |
Codepoint |
Latin |
Visual |
| α |
U+03B1 |
a |
α → a |
| γ |
U+03B3 |
y |
γ → y |
| ι |
U+03B9 |
i |
ι → i |
| κ |
U+03BA |
k |
κ → k |
| μ |
U+03BC |
u |
μ → u |
| ν |
U+03BD |
v |
ν → v |
| ο |
U+03BF |
o |
ο → o |
| ρ |
U+03C1 |
p |
ρ → p |
| τ |
U+03C4 |
t |
τ → t |
| υ |
U+03C5 |
u |
υ → u |
| ω |
U+03C9 |
w |
ω → w |
Armenian to Latin (2 pairs)
| Armenian |
Codepoint |
Latin |
Visual |
| Օ |
U+0555 |
O |
Օ → O |
| Ս |
U+054D |
S |
Ս → S |
Cherokee to Latin (1 pair)
| Cherokee |
Codepoint |
Latin |
Visual |
| Ꭺ |
U+13AA |
A |
Ꭺ → A |
Cyrillic Extended to Latin (5 pairs)
| Cyrillic |
Codepoint |
Latin |
Visual |
| Ӏ |
U+04C0 |
I |
Ӏ → I |
| ӏ |
U+04CF |
l |
ӏ → l |
| ԁ |
U+0501 |
d |
ԁ → d |
| ԛ |
U+051B |
q |
ԛ → q |
| ԝ |
U+051D |
w |
ԝ → w |
Latin Extended to Latin (1 pair)
| Character |
Codepoint |
Latin |
Visual |
| ı |
U+0131 |
i |
ı → i |
Typographic Replacements (7 pairs)
| Character |
Codepoint |
Name |
Replacement |
| − |
U+2212 |
Minus sign |
- (hyphen) |
| – |
U+2013 |
En dash |
- (hyphen) |
| — |
U+2014 |
Em dash |
- (hyphen) |
| ' |
U+2018 |
Left single quote |
' (apostrophe) |
| ' |
U+2019 |
Right single quote |
' (apostrophe) |
| " |
U+201C |
Left double quote |
" (straight quote) |
| " |
U+201D |
Right double quote |
" (straight quote) |
NFKC Normalization Examples
NFKC (Stage 3) normalizes compatibility forms before homoglyph replacement. Common transformations:
| Input |
Codepoint(s) |
Output |
Category |
| a |
U+FF41 |
a |
Fullwidth lowercase |
| A |
U+FF21 |
A |
Fullwidth uppercase |
| 0 |
U+FF10 |
0 |
Fullwidth digit |
| fi |
U+FB01 |
fi |
Ligature |
| ff |
U+FB00 |
ff |
Ligature |
| ² |
U+00B2 |
2 |
Superscript |
| ⁿ |
U+207F |
n |
Superscript |
| ① |
U+2460 |
1 |
Circled |
| 𝑎 |
U+1D44E |
a |
Math italic |
| 𝐀 |
U+1D400 |
A |
Math bold |