Question 1

What is Unicode normalization?

Accepted Answer

Unicode normalization converts text to a standard form to ensure consistent encoding. The same character can be encoded in multiple ways — for example, é can be a single code point (NFC) or a base letter + combining accent (NFD). Normalization resolves these inconsistencies.

Question 2

Which normalization form should I use?

Accepted Answer

NFC is the most common and is recommended for web content and APIs. NFD is useful for text processing that needs to inspect combining marks. NFKC and NFKD also normalize compatibility characters (e.g. ligatures, superscripts).

Question 3

Why would the same text look identical but fail a string equality check?

Accepted Answer

The same visible character can be encoded as different Unicode code point sequences. For example, 'é' can be stored as a single precomposed character (U+00E9, NFC form) or as 'e' plus a combining acute accent (U+0065 U+0301, NFD form). These two sequences are visually identical but not byte-equal. This causes string comparison failures in databases, file systems, and search indexes when inputs come from different systems. Normalizing both strings to the same form (NFC is the recommended standard) before comparison or storage prevents these invisible mismatches.

Question 4

What is the difference between NFC, NFD, NFKC, and NFKD normalization?

Accepted Answer

NFC (Canonical Decomposition, followed by Canonical Composition) — the most common form; produces precomposed characters. NFD (Canonical Decomposition) — decomposes into base character + combining marks, used for accent stripping. NFKC (Compatibility Decomposition, followed by Canonical Composition) — additionally normalizes compatibility variants: ﬁ ligature → fi, ½ → 1/2, ² → 2, ™ → TM, fullwidth Ａ → A. NFKD — compatibility decomposition without recomposition. For general text storage and APIs, use NFC. For search and comparison where font variants should match, use NFKC.

Question 5

Which normalization form should I use for database storage and API inputs?

Accepted Answer

NFC is the recommended default for all web and database contexts. The W3C recommends NFC for HTML, XML, and web content. macOS HFS+ file system normalizes file names to NFD, while Windows NTFS and Linux ext4 store names as-is (whatever form the application provides). MySQL and PostgreSQL store text as provided without normalization, which can lead to duplicate entries differing only in normalization form. A best practice for any application that accepts text input is to normalize all strings to NFC before storing them in the database.

Unicode Normalizer

Related Tools

Common Use Cases

Normalization Forms