Skip to content

Introduction to Unicode and UTF-8

New Course Coming Soon:

Get Really Good at Git

Unicode is an industry standard for consistent encoding of written text. Learn the basics and most important parts of it, in particular concerning UTF-8

Unicode is an industry standard for consistent encoding of written text.

There are lots of character sets which are used by computers, but Unicode is the first of its kind to aim to support every single written language on earth (and beyond!).

Its aim is to provide a unique number to identify every character for every language, on any platform.

Unicode maps every character to a specific code, called code point. A code point takes the form of U+<hex-code>, ranging from U+0000 to U+10FFFF.

An example code point looks like this: U+004F. Its meaning depends on the character encoding used.

Unicode defines different characters encodings, the most used ones being UTF-8, UTF-16 and UTF-32.

UTF-8 is definitely the most popular encoding in the Unicode family, especially on the Web. This document is written in UTF-8, for example.

Currently there are more than 135.000 different characters implemented, with space for more than 1.1 millions.

Scripts

All the Unicode supported characters are grouped into sections called scripts.

There is a script for every different character set:

The full list is defined in the ISO 15924 standard.

See more on scripts: https://en.wikipedia.org/wiki/Script_(Unicode)

Planes

In addition to scripts, there is another way that Unicode organizes its characters: planes.

Instead of grouping them by type, it checks the code point value:

PlaneRange
0U+0000 - U+FFFF
1U+10000 - U+1FFFF
2U+20000 - U+2FFFF
14U+E0000 - U+EFFFF
15U+F0000 - U+FFFFF
16U+100000 - U+10FFFF

There are 17 planes.

The first is special, it’s called Basic Multilingual Plane, or BMP, and contains most of the modern characters and symbols, from the Latin, Cyrillic, Greek scripts.

The other 16 planes are called astral planes. Worth noting that planes 3 to 13 are currently empty.

The code points contained in astral planes are called astral code points.

Astral code points are all points higher than U+10000.

Code units

Code points are internally stored as code units. A code unit is the bit representation of a character, and it’s length varies depending on the character encoding

UTF-32 uses a 32-bit code unit.

UTF-8 uses an 8-bit code unit, and UTF-16 uses a 16-bit code unit. If a code point needs a larger size, it will be represented by 2 (or more, in UTF-8) code units.

Graphemes

A grapheme is a symbol that represents a unit of a writing system. It’s basically your idea of a character and how it should look like.

Glyphs

A glyph is a graphic representation of a grapheme: how it is visually displayed on screen, the actual appearance on the display.

Sequences

Unicode lets you combine different characters to form a grapheme.

For example it’s the case of accented characters: the letter é can be expressed by using a combination of the letter e (U+0065) and the unicode character named “COMBINING ACUTE ACCENT” (U+0301):

"U+0065U+0301" ➡️ "é"

U+0301 in this case is what is described as a combining mark, one character that applies to the previous one to form a different grapheme.

Normalization

A characters can be sometimes represented using different combinations of code points.

For example it’s the case of accented characters: the letter é can be expressed both as U+00E9 and also as combining e (U+0065) and the unicode character named “COMBINING ACUTE ACCENT” (U+0301):

U+00E9       ➡️ "é"
U+0065U+0301 ➡️ "é"

The normalization process analyzes a string for those kind of ambiguities, and generates a string with the canonical representation of any character.

Without normalization, perfectly equal strings to the eye will be considered different because their internal representation changes:

Emojis

Emojis are Unicode astral plane characters, and they provide a way to have images on your screen without actually having real images, just font glyphs.

As an example, the 🐶 symbol is encoded as U+1F436.

The first 128 characters

The first 128 characters of Unicode are the same as the ASCII character set.

The first 32 characters, U+0000-U+001F (0-31) are called Control Codes.

They are an inheritance from the past and most of them are now obsolete. They were used for teletype machines, something that existed before the fax.

Characters from U+0020 (32) to U+007E (126) contain numbers, letters and some symbols:

UnicodeASCII codeGlyph
U+002032(space)
U+002133!
U+002234
U+002335#
U+002436$
U+002537%
U+002638&
U+002739
U+002840(
U+002941)
U+002A42*
U+002B43+
U+002C44,
U+002D45-
U+002E46.
U+002F47/
U+0030480
U+0031491
U+0032502
U+0033513
U+0034524
U+0035535
U+0036546
U+0037557
U+0038568
U+0039579
U+003A58:
U+003B59;
U+003C60<
U+003D61=
U+003E62>
U+003F63?
U+004064@
U+004165A
U+004266B
U+004367C
U+004468D
U+004569E
U+004670F
U+004771G
U+004872H
U+004973I
U+004A74J
U+004B75K
U+004C76L
U+004D77M
U+004E78N
U+004F79O
U+005080P
U+005181Q
U+005282R
U+005383S
U+005484T
U+005585U
U+005686V
U+005787W
U+005888X
U+005989Y
U+005A90Z
U+005B91[
U+005C92\
U+005D93]
U+005E94^
U+005F95_
U+006096`
U+006197a
U+006298b
U+006399c
U+0064100d
U+0065101e
U+0066102f
U+0067103g
U+0068104h
U+0069105i
U+006A106j
U+006B107k
U+006C108l
U+006D109m
U+006E110n
U+006F111o
U+0070112p
U+0071113q
U+0072114r
U+0073115s
U+0074116t
U+0075117u
U+0076118v
U+0077119w
U+0078120x
U+0079121y
U+007A122z
U+007B123{
U+007C124
U+007D125}
U+007E126~

U+007F (127) is the delete character.

Everything going forward is outside the realm of ASCII, and is part of Unicode exclusively.

You can find the whole list on Wikipedia: https://en.wikipedia.org/wiki/List_of_Unicode_characters

Unicode encodings

UTF-8

UTF-8 is a variable width character encoding, and it can encode every character covered by Unicode, using from 1 to 4 8-bit bytes.

It was originally designed by Ken Thompson and Rob Pike in 1992. Those names are familiar to those with any interest in the Go programming language, as they were two of the original creators of that as well.

It’s recommended by the W3C as the default encoding in HTML files, and stats indicate that it’s used on 91,3% of all web pages, as of April 2018.

At the time of its introduction, ASCII was the most popular character encoding in the western world. In ASCII all letters, digits and symbols were assigned a number, and this number. Being fixed to 8 bits, it could only represent a maximum of 255 characters, and it was enough.

UTF-8 was designed to be backward compatible with ASCII. This was very important for its adoption, as ASCII was much older (1963) and widespread, and moving to UTF-8 came almost transparently.

The first 128 characters of UTF-8 map exactly to ASCII. Why 128? Because ASCII uses 7-bit encoding, which allows up to 128 combinations. Why 7 bits? We now take 8 bits for granted, but back in the day when ASCII was conceived, 7 bit systems were popular as well.

Being 100% compatible with ASCII makes UTF-8 also very efficient, because the most frequently used characters in the western languages are encoded with 1 byte only.

Here is the map of the bytes usage:

Number of bytesStartEnd
1U+0000U+007F
2U+0080U+07FF
3U+0800U+FFFF
4U+10000U+10FFFF

Remember that in ASCII the characters were encoded as numbers? If the letter A in ASCII was represented with the number 65, using UTF-8 it’s encoded as U+0041.

Why not U+0065 you ask? Well because unicode uses an hexadecimal base, and instead of 10 you have U+000A and so on (basically, you have a set of 16 digits instead of 10)

Take a look at this video, which brilliantly explains this UTF-8 and ASCII compatibility.

UTF-16

UTF-16 is another very popular Unicode encoding. For example, it’s how Java internally represents any character. It’s also one of the 2 encodings JavaScript uses internally, along with UCS-2. It’s used by many other systems as well, like Windows.

UTF-16 is a variable length encoding system, like UTF-8, but uses 2 bytes (16 bits) as the minimum for any character representation. As such, it’s backwards incompatible with the ASCII standard.

Code points in the Basic Multilingual Plane (BMP) are stored using 2 bytes. Code points in astral planes are stored using 4 bytes.

UTF-32

UTF-8 uses a minimum of 1 byte, UTF-16 uses a minimum of 2 bytes.

UTF-32 always uses 4 bytes, without optimizing for space usage, and as such it wastes a lot of bandwidth.

This constrain makes it faster to operate on because you have less to check, as you can assume 4 bytes for all characters.

It’s not as popular as UTF-8 and UTF-16, but it has its applications.

Are you intimidated by Git? Can’t figure out merge vs rebase? Are you afraid of screwing up something any time you have to do something in Git? Do you rely on ChatGPT or random people’s answer on StackOverflow to fix your problems? Your coworkers are tired of explaining Git to you all the time? Git is something we all need to use, but few of us really master it. I created this course to improve your Git (and GitHub) knowledge at a radical level. A course that helps you feel less frustrated with Git. Launching Summer 2024. Join the waiting list!
→ Get my JavaScript Beginner's Handbook
→ Read my JavaScript Tutorials on The Valley of Code
→ Read my TypeScript Tutorial on The Valley of Code

Here is how can I help you: