A Gentle Introduction to Unicode and Encoding (Part 4)

In Part 3 we discussed some less than stellar attempts at encoding Unicode code points for use with a computer.

To recap:

  • two approaches to encoding Unicode that are worth mentioning UTF-16 and UTF-32.
  • both were wasteful and inefficient (i.e. they sucked)

UTF-8 was created as an attempt at representing Unicode code points in a more economic fashion. It doesn't directly map code points to binary value like UTF-16 and 32 did. Instead, it handles bytes as if they are simply "bit containers" (it will be clearer in a minute). The rules are simple. If a code point is between 0 and 127, the UTF-8 encoder creates an 1 byte container (8-bits) to hold the code point value. For code points beyond 127 it creates containers that are 2, 3 or more bytes long. Yup, UTF-8 can store code points using variable sized byte chunks.

But, I hear you ask, how can decoders tell which code points are 2 bytes and which are 3 and so forth? Won't they be confused like we saw previously?

Well, UTF-8 uses a clever trick to solve this problem. When encoding a code point, it first checks its value. If it's less than 128 (0 - 127), the encoder creates a 1 byte container that begins with a 0 flag.

UTF-8 Container for code points between 0 - 127

0xxx xxxx
  • The leading 0 here is the flag that indicates to the UTF-8 decoder that the container is 1 byte in size.

  • The next xxx xxxx is the space within the container where the code point is actually 'stored' For example to encode 'a' in UTF-8:

0110 0001    :   Unicode code point for 'a' (97)

0xxx xxxx    :   UTF-8 container for code points between 0 and 127
*110 0001    :   Code point 97
---------
0110 0001    :   'a' Encoded with UTF-8

Now, when a UTF-8 decoder encounters a byte, it first reads the leading bit and if it identifies 0, it immediately assumes that this is a 1 byte container. It then proceeds to read the next 7 bits (xxx xxxx) to find out exactly which code point the container holds. It finally simply matches the value in the Unicode map to output the corresponding character.

There's one interesting property of UTF-8 encoding of Unicode characters in the range 0 - 127. After UTF-8 encoding of these Unicode characters, the resulting UTF-8 code points have the same value as the original Unicode code points, which are in turn the same in ASCII. That is, 'a' encoded in UTF-8 has exactly the same value as its raw Unicode and ASCII code points. That property incidentally makes UTF-8 compatible with ASCII in the ASCII range.

0110 0001 : ASCII 'a'
0110 0001 : Unicode 'a'
0110 0001 : UTF-8 'a'

Now for something more interesting, lets look at UTF-8 encoding of code points above 127.

UTF-8 encodings of code points beyond 127

UTF-8 Container for code points between 128 and 2047 (0x80 - 0x7ff) :

110x xxxx  10xx xxxx

UTF-8 Container for code points between 2048 and 65,535 (0x800 - 0xffff) :

1110 xxxx  10xx xxxx  10xx xxxx

UTF-8 Container for code points between 65,536 and 4,194,303 (0x10000 - 0x3fffff) :

1111 0xxx  10xx xxxx  10xx xxxx  10xx xxxx

etc.

  • each container begins with a flag in the leading byte (leftmost byte), two or more 1s followed by a 0.
  • The 1s in the leading flag (110, 1110, etc) indicate the size of the container. i.e. 11 for 2 bytes and 111 for 3 bytes. The following 0 simply separates the flag from the data.
  • each inner byte begins with a 10 flag.
  • As with 1 byte containers seen earlier, the x's indicate where in the container the actual code point value is stored
  • the largest code point value possible in each container is (2^number_of_x)-1. e.g. to encode 'é' in UTF-8:
          1110 1001    :   Unicode code point for 'é' (233)
110x xxxx 10xx xxxx    :   UTF-8 container for code points between 128 and 2047
***0 0011 **10 1001    :   realign the code point's bits to fit within the container
-------------------
1100 0011 1010 1001    :   'é' encoded in UTF-8 becomes 0xC3A9

When the UTF-8 decoder encounters such byte strings, it does the same thing as previously. It reads the leading flag in the first byte. A 110 flag indicates a code point stored within a 2 bytes container, so the decoder reads 2 bytes. From the first byte it extracts all x's after the 110 flag and from the second byte it does the same after the 10 flag.

1100 0011 1010 1001    :  UTF-8 Encoded code point
110x xxxx 10xx xxxx    :  UTF-8 Container
-------------------
***0 0011 **10 1001    :  Unicode code point 233

It then looks up the Unicode map to find out which character to display.

As you can see 'é' encoded in UTF-8 requires 2 bytes, whereas Latin-1 stores it in 1, but it's already a better trade-off since it cost 2 bytes only for characters above 127, but with the convenience of an universal character set.

Hopefully, by now you grok Unicode and UTF-8. I have one last topic to go over.

Why am I sometimes seeing weird characters when I open a web page?

As we've seen, character encoding involves some encoding and some decoding. The decoding part is usually where problems arise. If you encode characters using one system and then try to decode them using another, chances are you'll run into mismatch. We know that most encoding schemes are compatible for the first 128 characters (the ASCII range). So, these characters usually display properly regardless of the chosen encoding. But lets take 'é' as a case (Unicode 233 or 0xE9) encoded with UTF-8 to be later read in Latin-1.

We've seen that after being encoded in UTF-8 'é' is represented as 0xC3A9.

1100 0011  1010 1001    :   'é' in UTF-8 is 0xC3A9

If you then try to decode that value with a Latin-1 decoder, you might have some unexpected results. Latin-1 only needs 1 byte to represent all of its characters and doesn't expect to go beyond one byte for a single character. But 0xC3A9 is 2 bytes long. To a Latin-1 decoder this looks like 2 code points, each 1 byte in length: 0xC3 and 0xA9.

1100 0011    :   0XC3 or 'Ã' in Latin-1
1010 1001    :   0XA9 or '©' in Latin-1

And that's exactly how it will read it. It'll first decode 0xC3 and display the corresponding Latin-1 value 'Ã' and then it'll decode 0xA9 and display '©'. So instead of 'é' you'll see 'é'.

Now the scenario the other way around. You have 'é' encoded in Latin-1 as 0xE9 and try to have an UTF-8 decoder read it:

1110 1001  :  0xE9

According to UTF-8, a byte starting with 1110 indicates a 3 bytes container. So the UTF-8 decoder will first read the leading byte, but then will expect the next 2 bytes to start with the inner byte flag 10. Instead there's are very strong probability that it will encounter some other pattern, which will cause a mismatch. Depending on the decoder's design, it might try to signal an error, or might attempt a desperate salvage of the problematic value with a replacement character, or it might just ignore it altogether and display nothing.

We've now concluded our four parts tutorial. I hope it's helped you in understanding Unicode and the encodings used to represent its characters in computing.

Thank you for reading.