Skip to content

Character Reference

Complete tables of all characters handled by navi-sanitize's pipeline.

Invisible Characters (492 total)

Zero-Width Characters (8)

Codepoint Name Why Stripped
U+200B Zero-width space Invisible word boundary; breaks token matching
U+200C Zero-width non-joiner Invisible modifier; evasion in keyword filters
U+200D Zero-width joiner Invisible modifier; used in emoji but exploited in text
U+200E Left-to-right mark Invisible directional control
U+200F Right-to-left mark Invisible directional control
U+2060 Word joiner Invisible no-break hint; evasion vector
U+FEFF BOM / zero-width no-break space Byte order mark; invisible at start or middle of string
U+180E Mongolian vowel separator Invisible separator; historically used for evasion

Format and Control Characters (27)

Codepoint Name Why Stripped
U+00AD Soft hyphen Invisible in most renderers; breaks word matching
U+034F Combining grapheme joiner Invisible modifier between characters
U+2009 Thin space Near-invisible space variant
U+200A Hair space Near-invisible space variant (thinnest)
U+2028 Line separator Invisible line break; XSS vector in JavaScript
U+2029 Paragraph separator Invisible paragraph break; XSS vector in JavaScript
U+FFF9 Interlinear annotation anchor Invisible annotation control
U+FFFA Interlinear annotation separator Invisible annotation control
U+FFFB Interlinear annotation terminator Invisible annotation control
U+FFFC Object replacement character Placeholder for embedded object; invisible in text
U+2061 Function application Invisible math operator
U+2062 Invisible times Invisible math operator
U+2063 Invisible separator Invisible math operator; hides keywords
U+2064 Invisible plus Invisible math operator
U+206A Inhibit symmetric swapping Deprecated format control
U+206B Activate symmetric swapping Deprecated format control
U+206C Inhibit Arabic form shaping Deprecated format control
U+206D Activate Arabic form shaping Deprecated format control
U+206E National digit shapes Deprecated format control
U+206F Nominal digit shapes Deprecated format control
U+2800 Braille pattern blank Renders as empty space
U+1680 Ogham space mark Invisible space variant
U+115F Hangul Choseong filler Invisible hangul filler
U+1160 Hangul Jungseong filler Invisible hangul filler
U+3164 Hangul filler NFKC-normalizes to U+1160; pre-stripped
U+FFA0 Halfwidth hangul filler NFKC-normalizes to U+1160; pre-stripped
U+061C Arabic letter mark Invisible directional mark

Mongolian Free Variation Selectors (4)

Codepoint Name Why Stripped
U+180B Mongolian FVS one Invisible glyph modifier; functionally identical to VS1--VS16
U+180C Mongolian FVS two Invisible glyph modifier
U+180D Mongolian FVS three Invisible glyph modifier
U+180F Mongolian FVS four Invisible glyph modifier

Variation Selectors (16)

Range Name Why Stripped
U+FE00--U+FE0F VS1--VS16 Invisible glyph presentation modifiers; alter character display without visible change in most contexts

Unicode Tag Block (128)

Range Name Why Stripped
U+E0000 Language tag (deprecated) Deprecated; invisible
U+E0001--U+E007F Tag characters Invisible ASCII encoding; used in tag smuggling attacks where tokenizers read content that humans cannot see

Variation Selector Supplement (240)

Range Name Why Stripped
U+E0100--U+E01EF VS17--VS256 Extended invisible glyph modifiers for CJK ideographic variation

Bidirectional Control Characters (9)

Codepoint Name Why Stripped
U+202A Left-to-right embedding Reorders text display to hide content
U+202B Right-to-left embedding Reorders text display to hide content
U+202C Pop directional formatting Terminates bidi override
U+202D Left-to-right override Forces display direction; hides malicious content
U+202E Right-to-left override Reverses displayed text order
U+2066 Left-to-right isolate Isolates text direction; evasion in mixed content
U+2067 Right-to-left isolate Isolates text direction; evasion in mixed content
U+2068 First strong isolate Auto-detects direction; unpredictable display
U+2069 Pop directional isolate Terminates bidi isolate

C0 Control Characters (28)

Range Why Stripped
U+0001--U+0008 Includes BEL (U+0007), BS (U+0008, terminal overwrite attack)
U+000B--U+000C Vertical tab, form feed
U+000E--U+001F Includes ESC (U+001B, ANSI injection); excludes TAB (U+0009), LF (U+000A), CR (U+000D)

C1 Control Characters (32)

Range Why Stripped
U+0080--U+009F Includes CSI (U+009B, equivalent to ESC+[) and NEL (U+0085); invisible in all modern contexts

Homoglyph Map (66 pairs)

Cyrillic to Latin --- Lowercase (11 pairs)

Cyrillic Codepoint Latin Visual
а U+0430 a а → a
е U+0435 e е → e
о U+043E o о → o
р U+0440 p р → p
с U+0441 c с → c
у U+0443 y у → y
х U+0445 x х → x
і U+0456 i і → i
ѕ U+0455 s ѕ → s
ј U+0458 j ј → j
һ U+04BB h һ → h

Cyrillic to Latin --- Uppercase (14 pairs)

Cyrillic Codepoint Latin Visual
А U+0410 A А → A
В U+0412 B В → B
Е U+0415 E Е → E
К U+041A K К → K
М U+041C M М → M
Н U+041D H Н → H
О U+041E O О → O
Р U+0420 P Р → P
С U+0421 C С → C
Т U+0422 T Т → T
Х U+0425 X Х → X
І U+0406 I І → I
Ѕ U+0405 S Ѕ → S
Ј U+0408 J Ј → J

Greek to Latin --- Uppercase (14 pairs)

Greek Codepoint Latin Visual
Α U+0391 A Α → A
Β U+0392 B Β → B
Ε U+0395 E Ε → E
Ζ U+0396 Z Ζ → Z
Η U+0397 H Η → H
Ι U+0399 I Ι → I
Κ U+039A K Κ → K
Μ U+039C M Μ → M
Ν U+039D N Ν → N
Ο U+039F O Ο → O
Ρ U+03A1 P Ρ → P
Τ U+03A4 T Τ → T
Υ U+03A5 Y Υ → Y
Χ U+03A7 X Χ → X

Greek to Latin --- Lowercase (11 pairs)

Greek Codepoint Latin Visual
α U+03B1 a α → a
γ U+03B3 y γ → y
ι U+03B9 i ι → i
κ U+03BA k κ → k
μ U+03BC u μ → u
ν U+03BD v ν → v
ο U+03BF o ο → o
ρ U+03C1 p ρ → p
τ U+03C4 t τ → t
υ U+03C5 u υ → u
ω U+03C9 w ω → w

Armenian to Latin (2 pairs)

Armenian Codepoint Latin Visual
Օ U+0555 O Օ → O
Ս U+054D S Ս → S

Cherokee to Latin (1 pair)

Cherokee Codepoint Latin Visual
U+13AA A Ꭺ → A

Cyrillic Extended to Latin (5 pairs)

Cyrillic Codepoint Latin Visual
Ӏ U+04C0 I Ӏ → I
ӏ U+04CF l ӏ → l
ԁ U+0501 d ԁ → d
ԛ U+051B q ԛ → q
ԝ U+051D w ԝ → w

Latin Extended to Latin (1 pair)

Character Codepoint Latin Visual
ı U+0131 i ı → i

Typographic Replacements (7 pairs)

Character Codepoint Name Replacement
U+2212 Minus sign - (hyphen)
U+2013 En dash - (hyphen)
U+2014 Em dash - (hyphen)
' U+2018 Left single quote ' (apostrophe)
' U+2019 Right single quote ' (apostrophe)
" U+201C Left double quote " (straight quote)
" U+201D Right double quote " (straight quote)

NFKC Normalization Examples

NFKC (Stage 3) normalizes compatibility forms before homoglyph replacement. Common transformations:

Input Codepoint(s) Output Category
U+FF41 a Fullwidth lowercase
U+FF21 A Fullwidth uppercase
U+FF10 0 Fullwidth digit
U+FB01 fi Ligature
U+FB00 ff Ligature
² U+00B2 2 Superscript
U+207F n Superscript
U+2460 1 Circled
𝑎 U+1D44E a Math italic
𝐀 U+1D400 A Math bold