UTF-8 Validator
Check if a byte sequence (in hex) is valid UTF-8 — for debugging file encoding. Decodes if valid.
UTF-8: the variable-length encoding that became the default of the web
UTF-8 is the dominant character encoding on the internet — by 2024 well above 98% of every web page. Designed by Ken Thompson and Rob Pike in 1992 and standardised as RFC 3629 (2003), it solves a hard compatibility problem: encode the entire Unicode catalogue (currently 154,998 assigned code points up to U+10FFFF) while remaining byte-compatible with 7-bit ASCII. That backward compatibility is the reason every modern operating system, browser, database and protocol picked UTF-8 as the safe default.
How UTF-8 encodes a code point in 1 to 4 bytes
UTF-8 is a variable-length encoding. A leading byte announces how many continuation bytes follow, then each continuation byte contributes 6 extra payload bits:
- 1 byte —
0xxxxxxx— code pointsU+0000toU+007F(pure ASCII). - 2 bytes —
110xxxxx 10xxxxxx—U+0080toU+07FF(Latin extended, Greek, Cyrillic, Hebrew, Arabic). - 3 bytes —
1110xxxx 10xxxxxx 10xxxxxx—U+0800toU+FFFF(BMP — Chinese, Japanese, Korean ideographs, most living scripts). - 4 bytes —
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx—U+10000toU+10FFFF(supplementary planes, emoji, historic scripts).
Continuation bytes always start with the bit pattern 10, which makes the byte stream self-synchronising: a parser can land anywhere and find the next start byte in at most three steps. Brazilian Portuguese sits in the 2-byte range — á = C3 A1, é = C3 A9, ç = C3 A7, ã = C3 A3.
"Olá" -> 4F 6C C3 A1 (3 chars, 4 bytes)
"日本" -> E6 97 A5 E6 9C AC (2 chars, 6 bytes)
"😀" -> F0 9F 98 80 (1 char, 4 bytes)
What makes a byte sequence invalid
A strict UTF-8 validator rejects six families of malformed sequences:
- Lone continuation byte — any
10xxxxxxwithout a preceding lead byte. - Truncated sequence — a lead byte that promises 2/3/4 bytes but does not deliver them.
- Overlong encoding — encoding a code point with more bytes than the minimum required (e.g.
C0 80for NUL is forbidden; some older Java APIs abused it as "modified UTF-8"). - Surrogate code points —
U+D800toU+DFFFare reserved for UTF-16 surrogate pairs and must never appear in UTF-8. - Out-of-range code point — anything above
U+10FFFF, which would need a 5- or 6-byte sequence (banned by RFC 3629). - Invalid lead bytes —
0xC0,0xC1,0xF5–0xFFcan never start a legitimate UTF-8 sequence.
Security tools rely on strict validation: an overlong encoding has been used in the past to smuggle ../ path traversals past naive filters (Nimda worm, IIS).
BOM, declarations and the charset header
The Byte Order Mark for UTF-8 is the three-byte prefix EF BB BF (the encoded form of U+FEFF). Unlike UTF-16, byte order has no meaning in UTF-8, so the BOM is purely a signal — and an optional one. The HTML5 spec actually recommends omitting it, while many Windows tools (Notepad, older Excel CSV exports) insist on writing it. PHP servers used to leak the BOM at the start of the response, breaking header() calls; that historic bug is one of the main reasons "no BOM" is the safer default.
Declaration channels in order of priority:
- HTTP header:
Content-Type: text/html; charset=UTF-8— wins over the document. - HTML5
<meta charset="UTF-8">within the first 1024 bytes of<head>. - XML prolog:
<?xml version="1.0" encoding="UTF-8"?>. - BOM as last-resort detection.
UTF-8 vs UTF-16 vs UTF-32 and the legacy of ISO-8859-1 / Windows-1252
UTF-16 uses 2 or 4 bytes per code point and is the native string format inside Windows, Java and JavaScript engines (V8 stores 16-bit strings internally). Outside those runtimes UTF-16 is rare because it is not ASCII-compatible and is endian-sensitive. UTF-32 uses a fixed 4 bytes — easy to index but extremely wasteful. ISO-8859-1 (Latin-1) and Windows-1252 are single-byte legacy encodings still found in very old Brazilian databases; converting them to UTF-8 typically requires explicit iconv or ICU.
MySQL utf8 versus utf8mb4 — the trap that ate emoji
Until version 8 of MySQL the alias utf8 was secretly utf8mb3: only up to 3-byte sequences, so it silently broke emoji and supplementary code points. The fix is the alias utf8mb4 with collation utf8mb4_0900_ai_ci or utf8mb4_unicode_520_ci. Always declare CHARACTER SET utf8mb4 in CREATE DATABASE, CREATE TABLE, the my.cnf, the JDBC URL and the SET NAMES on connection — otherwise the chain leaks back to utf8mb3 in some hop and corrupts the data.
Mojibake, double encoding and how to detect them
Mojibake is the visible garbage that appears when bytes are decoded with the wrong charset. Classic Brazilian Portuguese symptom: "olá" rendered as "olá" because UTF-8 bytes (C3 A1) were read as Latin-1 and then re-encoded as UTF-8. Detection heuristics include checking for impossible byte sequences (e.g. C3 83 C2 A1 for "á"), running chardet / ICU detectors and inspecting the byte length to character count ratio. This tool validates strictly and shows you the byte breakdown so you can spot the issue at the source.
FAQ
Is the BOM required for UTF-8 files? No. RFC 3629 and the HTML5 spec recommend omitting it. Use it only if the consumer explicitly expects it (some Excel CSV workflows on Windows).
Do emojis really need 4 bytes? Yes, almost all of them. Emoji live in the supplementary planes from U+1F000 upwards, which UTF-8 encodes with the 4-byte form. Skin-tone modifiers and ZWJ sequences combine several 4-byte code points.
Should I pick utf8 or utf8mb4 in MySQL? Always utf8mb4. The plain "utf8" alias is a historic mistake and silently breaks emoji and supplementary characters.
Why does my Brazilian text show "?" or "é" in the database? A charset mismatch on insert. Either the connection charset is wrong (set charset=utf8mb4 in the DSN) or the data was already mojibake at the source. Fix the pipeline, then re-encode with iconv -f LATIN1 -t UTF-8 if needed.
Can a single byte be valid UTF-8 by itself? Only if it is in the range 00–7F. Anything between 80 and FF alone is invalid — it has to be part of a multi-byte sequence.
Related Tools
CPF Validator
Validate Brazilian CPF numbers instantly using the official algorithm. Useful for testing document validation in applications. No data sent to servers.
Batch CPF Validator
Validate a list of CPFs (one per line) and see which are valid and which are not. No data sent to servers.
Batch CNPJ Validator
Validate a list of CNPJs (one per line) with a summary of valid, invalid and total. No data sent to servers.