Just a few weeks until the 2021 JavaScript Full-Stack Bootcamp opens.
Signup to the waiting list!
- Scripts
- Planes
- Code units
- Graphemes
- Glyphs
- Sequences
- Normalization
- Emojis
- The first 128 characters
- Unicode encodings
Unicode is an industry standard for consistent encoding of written text.
There are lots of character sets which are used by computers, but Unicode is the first of its kind to aim to support every single written language on earth (and beyond!).
Its aim is to provide a unique number to identify every character for every language, on any platform.
Unicode maps every character to a specific code, called code point. A code point takes the form of U+<hex-code>
, ranging from U+0000
to U+10FFFF
.
An example code point looks like this: U+004F
. Its meaning depends on the character encoding used.
Unicode defines different characters encodings, the most used ones being UTF-8, UTF-16 and UTF-32.
UTF-8 is definitely the most popular encoding in the Unicode family, especially on the Web. This document is written in UTF-8, for example.
Currently there are more than 135.000 different characters implemented, with space for more than 1.1 millions.
Scripts
All the Unicode supported characters are grouped into sections called scripts.
There is a script for every different character set:
- Latin (contains all ASCII + all the other western world characters)
- Korean
- Old Hungarian
- Hebrew
- Greek
- Armenian
- …and so on!
The full list is defined in the ISO 15924 standard.
See more on scripts: https://en.wikipedia.org/wiki/Script_(Unicode)
Planes
In addition to scripts, there is another way that Unicode organizes its characters: planes.
Instead of grouping them by type, it checks the code point value:
Plane | Range |
---|---|
0 | U+0000 - U+FFFF |
1 | U+10000 - U+1FFFF |
2 | U+20000 - U+2FFFF |
… | … |
14 | U+E0000 - U+EFFFF |
15 | U+F0000 - U+FFFFF |
16 | U+100000 - U+10FFFF |
There are 17 planes.
The first is special, it’s called Basic Multilingual Plane, or BMP, and contains most of the modern characters and symbols, from the Latin, Cyrillic, Greek scripts.
The other 16 planes are called astral planes. Worth noting that planes 3 to 13 are currently empty.
The code points contained in astral planes are called astral code points.
Astral code points are all points higher than U+10000
.
Code units
Code points are internally stored as code units. A code unit is the bit representation of a character, and it’s length varies depending on the character encoding
UTF-32 uses a 32-bit code unit.
UTF-8 uses an 8-bit code unit, and UTF-16 uses a 16-bit code unit. If a code point needs a larger size, it will be represented by 2 (or more, in UTF-8) code units.
Graphemes
A grapheme is a symbol that represents a unit of a writing system. It’s basically your idea of a character and how it should look like.
Glyphs
A glyph is a graphic representation of a grapheme: how it is visually displayed on screen, the actual appearance on the display.
Sequences
Unicode lets you combine different characters to form a grapheme.
For example it’s the case of accented characters: the letter é
can be expressed by using a combination of the letter e
(U+0065
) and the unicode character named “COMBINING ACUTE ACCENT” (U+0301
):
"U+0065U+0301" ➡️ "é"
U+0301
in this case is what is described as a combining mark, one character that applies to the previous one to form a different grapheme.
Normalization
A characters can be sometimes represented using different combinations of code points.
For example it’s the case of accented characters: the letter é
can be expressed both as U+00E9
and also as combining e
(U+0065
) and the unicode character named “COMBINING ACUTE ACCENT” (U+0301
):
U+00E9 ➡️ "é"
U+0065U+0301 ➡️ "é"
The normalization process analyzes a string for those kind of ambiguities, and generates a string with the canonical representation of any character.
Without normalization, perfectly equal strings to the eye will be considered different because their internal representation changes:
Emojis
Emojis are Unicode astral plane characters, and they provide a way to have images on your screen without actually having real images, just font glyphs.
As an example, the 🐶 symbol is encoded as U+1F436
.
The first 128 characters
The first 128 characters of Unicode are the same as the ASCII character set.
The first 32 characters, U+0000
-U+001F
(0-31) are called Control Codes.
They are an inheritance from the past and most of them are now obsolete. They were used for teletype machines, something that existed before the fax.
Characters from U+0020 (32) to U+007E (126) contain numbers, letters and some symbols:
Unicode | ASCII code | Glyph |
---|---|---|
U+0020 | 32 | (space) |
U+0021 | 33 | ! |
U+0022 | 34 | “ |
U+0023 | 35 | # |
U+0024 | 36 | $ |
U+0025 | 37 | % |
U+0026 | 38 | & |
U+0027 | 39 | ‘ |
U+0028 | 40 | ( |
U+0029 | 41 | ) |
U+002A | 42 | * |
U+002B | 43 | + |
U+002C | 44 | , |
U+002D | 45 | - |
U+002E | 46 | . |
U+002F | 47 | / |
U+0030 | 48 | 0 |
U+0031 | 49 | 1 |
U+0032 | 50 | 2 |
U+0033 | 51 | 3 |
U+0034 | 52 | 4 |
U+0035 | 53 | 5 |
U+0036 | 54 | 6 |
U+0037 | 55 | 7 |
U+0038 | 56 | 8 |
U+0039 | 57 | 9 |
U+003A | 58 | : |
U+003B | 59 | ; |
U+003C | 60 | < |
U+003D | 61 | = |
U+003E | 62 | > |
U+003F | 63 | ? |
U+0040 | 64 | @ |
U+0041 | 65 | A |
U+0042 | 66 | B |
U+0043 | 67 | C |
U+0044 | 68 | D |
U+0045 | 69 | E |
U+0046 | 70 | F |
U+0047 | 71 | G |
U+0048 | 72 | H |
U+0049 | 73 | I |
U+004A | 74 | J |
U+004B | 75 | K |
U+004C | 76 | L |
U+004D | 77 | M |
U+004E | 78 | N |
U+004F | 79 | O |
U+0050 | 80 | P |
U+0051 | 81 | Q |
U+0052 | 82 | R |
U+0053 | 83 | S |
U+0054 | 84 | T |
U+0055 | 85 | U |
U+0056 | 86 | V |
U+0057 | 87 | W |
U+0058 | 88 | X |
U+0059 | 89 | Y |
U+005A | 90 | Z |
U+005B | 91 | [ |
U+005C | 92 | |
U+005D | 93 | ] |
U+005E | 94 | ^ |
U+005F | 95 | _ |
U+0060 | 96 | ` |
U+0061 | 97 | a |
U+0062 | 98 | b |
U+0063 | 99 | c |
U+0064 | 100 | d |
U+0065 | 101 | e |
U+0066 | 102 | f |
U+0067 | 103 | g |
U+0068 | 104 | h |
U+0069 | 105 | i |
U+006A | 106 | j |
U+006B | 107 | k |
U+006C | 108 | l |
U+006D | 109 | m |
U+006E | 110 | n |
U+006F | 111 | o |
U+0070 | 112 | p |
U+0071 | 113 | q |
U+0072 | 114 | r |
U+0073 | 115 | s |
U+0074 | 116 | t |
U+0075 | 117 | u |
U+0076 | 118 | v |
U+0077 | 119 | w |
U+0078 | 120 | x |
U+0079 | 121 | y |
U+007A | 122 | z |
U+007B | 123 | { |
U+007C | 124 | |
U+007D | 125 | } |
U+007E | 126 | ~ |
- Numbers go from
U+0030
toU+0039
- Uppercase letters go from
U+0041
toU+005A
- Lowercase letters go from
U+0061
toU+007A
U+007F (127) is the delete character.
Everything going forward is outside the realm of ASCII, and is part of Unicode exclusively.
You can find the whole list on Wikipedia: https://en.wikipedia.org/wiki/List_of_Unicode_characters
Unicode encodings
UTF-8
UTF-8 is a variable width character encoding, and it can encode every character covered by Unicode, using from 1 to 4 8-bit bytes.
It was originally designed by Ken Thompson and Rob Pike in 1992. Those names are familiar to those with any interest in the Go programming language, as they were two of the original creators of that as well.
It’s recommended by the W3C as the default encoding in HTML files, and stats indicate that it’s used on 91,3% of all web pages, as of April 2018.
At the time of its introduction, ASCII was the most popular character encoding in the western world. In ASCII all letters, digits and symbols were assigned a number, and this number. Being fixed to 8 bits, it could only represent a maximum of 255 characters, and it was enough.
UTF-8 was designed to be backward compatible with ASCII. This was very important for its adoption, as ASCII was much older (1963) and widespread, and moving to UTF-8 came almost transparently.
The first 128 characters of UTF-8 map exactly to ASCII. Why 128? Because ASCII uses 7-bit encoding, which allows up to 128 combinations. Why 7 bits? We now take 8 bits for granted, but back in the day when ASCII was conceived, 7 bit systems were popular as well.
Being 100% compatible with ASCII makes UTF-8 also very efficient, because the most frequently used characters in the western languages are encoded with 1 byte only.
Here is the map of the bytes usage:
Number of bytes | Start | End |
---|---|---|
1 | U+0000 |
U+007F |
2 | U+0080 |
U+07FF |
3 | U+0800 |
U+FFFF |
4 | U+10000 |
U+10FFFF |
Remember that in ASCII the characters were encoded as numbers? If the letter A
in ASCII was represented with the number 65
, using UTF-8 it’s encoded as U+0041
.
Why not U+0065
you ask? Well because unicode uses an hexadecimal base, and instead of 10
you have U+000A
and so on (basically, you have a set of 16 digits instead of 10)
Take a look at this video, which brilliantly explains this UTF-8 and ASCII compatibility.
UTF-16
UTF-16 is another very popular Unicode encoding. For example, it’s how Java internally represents any character. It’s also one of the 2 encodings JavaScript uses internally, along with UCS-2. It’s used by many other systems as well, like Windows.
UTF-16 is a variable length encoding system, like UTF-8, but uses 2 bytes (16 bits) as the minimum for any character representation. As such, it’s backwards incompatible with the ASCII standard.
Code points in the Basic Multilingual Plane (BMP) are stored using 2 bytes. Code points in astral planes are stored using 4 bytes.
UTF-32
UTF-8 uses a minimum of 1 byte, UTF-16 uses a minimum of 2 bytes.
UTF-32 always uses 4 bytes, without optimizing for space usage, and as such it wastes a lot of bandwidth.
This constrain makes it faster to operate on because you have less to check, as you can assume 4 bytes for all characters.
It’s not as popular as UTF-8 and UTF-16, but it has its applications.
Download my free JavaScript Beginner's Handbook
The 2021 JavaScript Full-Stack Bootcamp will start at the end of March 2021. Don't miss this opportunity, signup to the waiting list!
More js tutorials:
- Things to avoid in JavaScript (the bad parts)
- Deferreds and Promises in JavaScript (+ Ember.js example)
- How to upload files to the server using JavaScript
- JavaScript Coding Style
- An introduction to JavaScript Arrays
- Introduction to the JavaScript Programming Language
- The Complete ECMAScript 2015-2019 Guide
- Understanding JavaScript Promises
- The Lexical Structure of JavaScript
- JavaScript Types
- JavaScript Variables
- A list of sample Web App Ideas
- An introduction to Functional Programming with JavaScript
- Modern Asynchronous JavaScript with Async and Await
- JavaScript Loops and Scope
- The Map JavaScript Data Structure
- The Set JavaScript Data Structure
- A guide to JavaScript Template Literals
- Roadmap to Learn JavaScript
- JavaScript Expressions
- Discover JavaScript Timers
- JavaScript Events Explained
- JavaScript Loops
- Write JavaScript loops using map, filter, reduce and find
- The JavaScript Event Loop
- JavaScript Functions
- The JavaScript Glossary
- JavaScript Closures explained
- A tutorial to JavaScript Arrow Functions
- A guide to JavaScript Regular Expressions
- How to check if a string contains a substring in JavaScript
- How to remove an item from an Array in JavaScript
- How to deep clone a JavaScript object
- Introduction to Unicode and UTF-8
- Unicode in JavaScript
- How to uppercase the first letter of a string in JavaScript
- How to format a number as a currency value in JavaScript
- How to convert a string to a number in JavaScript
- this in JavaScript
- How to get the current timestamp in JavaScript
- JavaScript Strict Mode
- JavaScript Immediately-invoked Function Expressions (IIFE)
- How to redirect to another web page using JavaScript
- How to remove a property from a JavaScript object
- How to append an item to an array in JavaScript
- How to check if a JavaScript object property is undefined
- Introduction to ES Modules
- Introduction to CommonJS
- JavaScript Asynchronous Programming and Callbacks
- How to replace all occurrences of a string in JavaScript
- A quick reference guide to Modern JavaScript Syntax
- How to trim the leading zero in a number in JavaScript
- How to inspect a JavaScript object
- The definitive guide to JavaScript Dates
- A Moment.js tutorial
- Semicolons in JavaScript
- The JavaScript Arithmetic operators
- The JavaScript Math object
- Generate random and unique strings in JavaScript
- How to make your JavaScript functions sleep
- JavaScript Prototypal Inheritance
- JavaScript Exceptions
- How to use JavaScript Classes
- The JavaScript Cookbook
- Quotes in JavaScript
- How to validate an email address in JavaScript
- How to get the unique properties of a set of objects in a JavaScript array
- How to check if a string starts with another in JavaScript
- How to create a multiline string in JavaScript
- The ES6 Guide
- How to get the current URL in JavaScript
- The ES2016 Guide
- How to initialize a new array with values in JavaScript
- The ES2017 Guide
- The ES2018 Guide
- How to use Async and Await with Array.prototype.map()
- Async vs sync code
- How to generate a random number between two numbers in JavaScript
- HTML Canvas API Tutorial
- How to get the index of an iteration in a for-of loop in JavaScript
- What is a Single Page Application?
- An introduction to WebAssembly
- Introduction to JSON
- The JSONP Guide
- Should you use or learn jQuery in 2020?
- How to hide a DOM element using plain JavaScript
- How to merge two objects in JavaScript
- How to empty a JavaScript array
- How to encode a URL with JavaScript
- How to set default parameter values in JavaScript
- How to sort an array of objects by a property value in JavaScript
- How to count the number of properties in a JavaScript object
- call() and apply() in JavaScript
- Introduction to PeerJS, the WebRTC library
- Work with objects and arrays using Rest and Spread
- Destructuring Objects and Arrays in JavaScript
- The definitive guide to debugging JavaScript
- The TypeScript Guide
- Dynamically select a method of an object in JavaScript
- Passing undefined to JavaScript Immediately-invoked Function Expressions
- Loosely typed vs strongly typed languages
- How to style DOM elements using JavaScript
- Casting in JavaScript
- JavaScript Generators Tutorial
- The node_modules folder size is not a problem. It's a privilege
- How to solve the unexpected identifier error when importing modules in JavaScript
- How to list all methods of an object in JavaScript
- The String replace() method
- The String search() method
- How I run little JavaScript snippets
- The ES2019 Guide
- The String charAt() method
- The String charCodeAt() method
- The String codePointAt() method
- The String concat() method
- The String endsWith() method
- The String includes() method
- The String indexOf() method
- The String lastIndexOf() method
- The String localeCompare() method
- The String match() method
- The String normalize() method
- The String padEnd() method
- The String padStart() method
- The String repeat() method
- The String slice() method
- The String split() method
- The String startsWith() method
- The String substring() method
- The String toLocaleLowerCase() method
- The String toLocaleUpperCase() method
- The String toLowerCase() method
- The String toString() method
- The String toUpperCase() method
- The String trim() method
- The String trimEnd() method
- The String trimStart() method
- Memoization in JavaScript
- The String valueOf() method
- JavaScript Reference: String
- The Number isInteger() method
- The Number isNaN() method
- The Number isSafeInteger() method
- The Number parseFloat() method
- The Number parseInt() method
- The Number toString() method
- The Number valueOf() method
- The Number toPrecision() method
- The Number toExponential() method
- The Number toLocaleString() method
- The Number toFixed() method
- The Number isFinite() method
- JavaScript Reference: Number
- JavaScript Property Descriptors
- The Object assign() method
- The Object create() method
- The Object defineProperties() method
- The Object defineProperty() method
- The Object entries() method
- The Object freeze() method
- The Object getOwnPropertyDescriptor() method
- The Object getOwnPropertyDescriptors() method
- The Object getOwnPropertyNames() method
- The Object getOwnPropertySymbols() method
- The Object getPrototypeOf() method
- The Object is() method
- The Object isExtensible() method
- The Object isFrozen() method
- The Object isSealed() method
- The Object keys() method
- The Object preventExtensions() method
- The Object seal() method
- The Object setPrototypeOf() method
- The Object values() method
- The Object hasOwnProperty() method
- The Object isPrototypeOf() method
- The Object propertyIsEnumerable() method
- The Object toLocaleString() method
- The Object toString() method
- The Object valueOf() method
- JavaScript Reference: Object
- JavaScript Assignment Operator
- JavaScript Internationalization
- JavaScript typeof Operator
- JavaScript new Operator
- JavaScript Comparison Operators
- JavaScript Operators Precedence Rules
- JavaScript instanceof Operator
- JavaScript Statements
- JavaScript Scope
- JavaScript Type Conversions (casting)
- JavaScript Equality Operators
- The JavaScript if/else conditional
- The JavaScript Switch Conditional
- The JavaScript delete Operator
- JavaScript Function Parameters
- The JavaScript Spread Operator
- JavaScript Return Values
- JavaScript Logical Operators
- JavaScript Ternary Operator
- JavaScript Recursion
- JavaScript Object Properties
- JavaScript Error Objects
- The JavaScript Global Object
- The JavaScript filter() Function
- The JavaScript map() Function
- The JavaScript reduce() Function
- The JavaScript `in` operator
- JavaScript Operators
- How to get the value of a CSS property in JavaScript
- How to add an event listener to multiple elements in JavaScript
- JavaScript Private Class Fields
- How to sort an array by date value in JavaScript
- JavaScript Public Class Fields
- JavaScript Symbols
- How to use the JavaScript bcrypt library
- How to rename fields when using object destructuring
- How to check types in JavaScript without using TypeScript
- How to check if a JavaScript array contains a specific value
- What does the double negation operator !! do in JavaScript?
- Which equal operator should be used in JavaScript comparisons? == vs ===
- Is JavaScript still worth learning?
- How to return the result of an asynchronous function in JavaScript
- How to check if an object is empty in JavaScript
- How to break out of a for loop in JavaScript
- How to add item to an array at a specific index in JavaScript
- Why you should not modify a JavaScript object prototype
- What's the difference between using let and var in JavaScript?
- Links used to activate JavaScript functions
- How to join two strings in JavaScript
- How to join two arrays in JavaScript
- How to check if a JavaScript value is an array?
- How to get last element of an array in JavaScript?
- How to send urlencoded data using Axios
- How to get tomorrow's date using JavaScript
- How to get yesterday's date using JavaScript
- How to get the month name from a JavaScript date
- How to check if two dates are the same day in JavaScript
- How to check if a date refers to a day in the past in JavaScript
- JavaScript labeled statements
- How to wait for 2 or more promises to resolve in JavaScript
- How to get the days between 2 dates in JavaScript
- How to upload a file using Fetch
- How to format a date in JavaScript
- How to iterate over object properties in JavaScript
- How to calculate the number of days between 2 dates in JavaScript
- How to use top-level await in ES Modules
- JavaScript Dynamic Imports
- JavaScript Optional Chaining
- How to replace white space inside a string in JavaScript
- JavaScript Nullish Coalescing
- How to flatten an array in JavaScript
- This decade in JavaScript
- How to send the authorization header using Axios
- List of keywords and reserved words in JavaScript
- How to convert an Array to a String in JavaScript
- How to remove all the node_modules folders content
- How to remove duplicates from a JavaScript array
- Let vs Const in JavaScript
- The same POST API call in various JavaScript libraries
- How to get the first n items in an array in JS
- How to divide an array in multiple equal parts in JS
- How to slow down a loop in JavaScript
- How to load an image in an HTML canvas
- How to cut a string into words in JavaScript
- How to divide an array in half in JavaScript
- How to write text into to an HTML canvas
- How to remove the last character of a string in JavaScript
- How to remove the first character of a string in JavaScript
- How to fix the TypeError: Cannot assign to read only property 'exports' of object '#<Object>' error
- How to create an exit intent popup
- How to check if an element is a descendant of another
- How to force credentials to every Axios request
- How to solve the "is not a function" error in JavaScript
- Gatsby, how to change the favicon
- Loading an external JS file using Gatsby
- How to detect dark mode using JavaScript
- Parcel, how to fix the `regeneratorRuntime is not defined` error
- How to detect if an Adblocker is being used with JavaScript
- Object destructuring with types in TypeScript
- The Deno Handbook: a concise introduction to Deno 🦕
- How to get the last segment of a path or URL using JavaScript
- How to shuffle elements in a JavaScript array
- How to check if a key exists in a JavaScript object
- Event bubbling and event capturing
- event.stopPropagation vs event.preventDefault() vs. return false in DOM events
- Primitive types vs objects in JavaScript
- How can you tell what type a value is, in JavaScript?
- How to return multiple values from a function in JavaScript
- Arrow functions vs regular functions in JavaScript
- In which ways can we access the value of a property of an object?
- What is the difference between null and undefined in JavaScript?
- What's the difference between a method and a function?
- What are the ways we can break out of a loop in JavaScript?
- The JavaScript for..of loop
- What is object destructuring in JavaScript?
- What is hoisting in JavaScript?
- How to change commas into dots with JavaScript
- The importance of timing when working with the DOM
- How to reverse a JavaScript array
- How to check if a value is a number in JavaScript
- How to accept unlimited parameters in a JavaScript function
- JavaScript Proxy Objects
- Event delegation in the browser using vanilla JavaScript
- The JavaScript super keyword
- Introduction to XState
- Are values passed by reference or by value in JavaScript?
- Custom events in JavaScript
- Custom errors in JavaScript
- Namespaces in JavaScript
- A curious usage of commas in JavaScript
- Chaining method calls in JavaScript
- How to handle promise rejections
- How to swap two array elements in JavaScript
- How I fixed a "cb.apply is not a function" error while using Gitbook
- How to add an item at the beginning of an array in JavaScript
- Gatsby, fix the "cannot find module gatsby-cli/lib/reporter" error
- How to get the index of an item in a JavaScript array
- How to test for an empty object in JavaScript
- How to destructure an object to existing variables in JavaScript
- The Array JavaScript Data Structure
- The Stack JavaScript Data Structure
- JavaScript Data Structures: Queue
- JavaScript Data Structures: Set
- JavaScript Data Structures: Dictionaries
- JavaScript Data Structures: Linked lists
- JavaScript, how to export a function
- JavaScript, how to export multiple functions
- JavaScript, how to exit a function
- JavaScript, how to find a character in a string
- JavaScript, how to filter an array
- JavaScript, how to extend a class
- JavaScript, how to find duplicates in an array
- JavaScript, how to replace an item of an array
- JavaScript Algorithms: Linear Search
- JavaScript Algorithms: Binary Search
- JavaScript Algorithms: Selection Sort
- JavaScript Algorithms: Quicksort
- JavaScript Algorithms: Merge Sort
- JavaScript Algorithms: Bubble Sort
- Wait for all promises to resolve in JavaScript