A Gentle Introduction to Unicode and Encoding (Part 3)

In Part 2 we introduced Unicode and its code points. We continue now by exploring attempts at encoding Unicode code points for use in a computer.

Binary encoding of Unicode

I have the vague memory of mentioning in Part 2 that Unicode code points are just numbers that map to characters. That still holds true, but as tempting as it is to write Unicode encoded mail message with paper and pen, it's slightly more practical and useful to do so with a computer. However, before your computer can read and write messages in Unicode code points, it also needs its very own Unicode map to encode and decode these messages. Remember that computers understand everything in binary, so we need a way to convert Unicode code points to binary. This process is what encoding is all about. Once your computer has its own binary Unicode map, it can use it in a way similar to what you'd do with your paper, pencil and Unicode map.

Hold on tight here, because this is usually another corner where people get lost in the dust about Unicode.

If I were to ask to some smart folks, such as yourself, to devise a scheme to represent Unicode code points in binary format (ones and zeros), most would just roll their eyes and go: "Duh! You have an Unicode map full of these code points, and you've been drilling in our heads that they're just numbers that map to characters. Newsflash dude! Binary is also just a number representation that computers can understand. So to write a message with code points, just write each code point value in the document with its corresponding binary value. To read a message and display characters, you do the reverse. You take the code points in the message, find matching values on the binary Unicode map and just display the corresponding characters. That's how ASCII works, that's how Latin-1 works, why don't you just do the same with Unicode? Double duh!"

Hmm, that actually sounds pretty straight forward and intuitive. Unicode code points are numbers. So, simply put these number's in binary format, exactly the way it's done in ASCII, Latin-1 and most other legacy encodings. Ok, that's fine, lets give it a go.

So we manage to send our first Unicode encoded message with your little scheme and this is what our correspondent receives:

11011011  10010010  00010110  10001101  11011000

It does indeed represent some code points. But wait a minute, which ones? See, back when we used ASCII and other legacy encodings, we knew that each character was 1 byte long (8 bits). One notable difference between Unicode and other character sets is that Unicode is huge. As I said in a previous article, it has over 1.1 million characters and is still expanding. That is, while some of these code points are small enough to fit 1, 2 or 3 bits others have a decimal value exceeding one million, a number which in binary requires 24 bits or 3 bytes (8 bits per byte). How would the decoder know which code point is only 1 byte, which is 2 or 3?

So going back to the string of bytes that the computer is trying to decode

  • was it 1 byte followed by 2 groups of 2 bytes?
  • or was it rather 2 bytes followed by 1 byte, then followed by 2 other bytes
  • Could it be 3 bytes followed by 2 bytes?
  • nah, it looks more like 3 bytes, followed by 1 byte and then another. Let's first try it like that and maybe we'll get lucky.

It's getting clear that we'll run into problems soon. There's no way to predict the size of a code point just by looking at bits. They could be 1, 2 or 3 bytes, no way to tell. The obvious sensible solution in this situation is to agree on a "one-size fit-all" approach, where all code points have the same size. That way the decoder knows exactly where each code point begins and where it ends. The space also needs to be large enough for the largest values to fit in. In the case of Unicode this is a minimum of 3 bytes. To illustrate, suppose the message above was in fact a 2 bytes code point, followed by a single byte code point, then another double bytes code point. For the decoder to read them properly, we would have to pad each code point to make them the same size (3 bytes) before sending the message. Here's the padded version of our message:

00000000  11011011  10010010  (2 bytes code point)
00000000  00000000  00010110  (1 byte code point)
00000000  10001101  11011000  (2 bytes code point)

Some of the earliest Unicode encoding schemes actually worked with this approach. One scheme called UCS2 aka UTF-16 required 2 bytes (16 bits) and occasionally used 4 bytes in special circumstances1. Another scheme, UCS4 aka UTF-32 just settled with simply using 4 bytes (32 bits) for all characters.

It all seemed fine, but look, padding code points with empty bytes was just considered wasteful, especially in the context of a time when computers resources needed to be agressively managed at the software level. It also introduced some efficiency issues with certain processor architectures. Other than that, it was just ugly to folks who work 99% of the time with characters that only actually require 1 byte in legacy encodings (ASCII, Latin-1, etc).

                                 0110 0001   :   'a' in ascii
                                 0110 0001   :   'a' in Latin-1
                                 0110 0001   :   'a' in Unicode code point
                      0000 0000  0110 0001   :   'a' in Unicode encoded with UCS2
0000 0000  0000 0000  0000 0000  0110 0001   :   'a' in Unicode encoded with UCS4

                                 1110 1001   :   'é' in Latin-1
                                 1110 1001   :   'é' in Unicode code point
                      0000 0000  1110 1001   :   'é' in Unicode encoded with UCS2
0000 0000  0000 0000  0000 0000  1110 1001   :   'é' in Unicode encoded with UCS4

What a waste indeed. Plus, attempts at optimizing the scheme gave birth to some nasty stuff such as Byte Order Marks (BOM), which complicated the whole thing even more and people just went Nah, lets just go back to use Latin-1 or whatever else they were using. So UCS2 and UCS4 pretty much bombed.

Recap:

- two approaches to encoding Unicode that are worth mentioning UTF-16 and UTF-32.
- both were wasteful and inefficient (i.e. they sucked)

In Part 4 of this series I'll cover UTF-8 and how it saved Unicode.

Thank you for reading.


  1. UCS2, using only 2 bytes, can't map more than 65,536 characters (2^16). To get beyond that limitation it uses a trick called Surrogates. Unicode has a range of 2048 unused code points within those first 65,536. By splitting that range in 2 and combining a code point from the upper half (2 bytes) to a code point from the lower half (2 other bytes), you can get 1024 x 1024 new code points, each being 4 bytes long.