A Gentle Introduction to Unicode and Encoding (Part 2)

In a Part 1 I went over the binary and hex numeral systems. I gave a brief explanation of the ASCII character set and the reasons behind extended encodings.

Recap of Part 1:

  • ASCII was designed with 128 character positions (0 - 127)
  • the last position being at 0111 1111 (127).
  • ASCII requires only 1 byte as characters go from 0000 0000 to 0111 1111.
  • ASCII didn't include non-english characters.
  • non-english speakers needed them.
  • to extend ASCII the unused leading bit in the byte was turned on (1000 0000)
  • it provided 128 more positions for new characters (1000 0000 to 1111 1111).
  • though, not everyone agreed on which character should go where.
  • so ensued various incompatible implementations from position 128 onward.

It was becoming clear that having many incompatible character sets would cause problems on the long run. An attempt to solve this was to create a universal character set, a map of all possible symbols and characters. It would be called Unicode.

Unicode Code Points

At the time of writing, Unicode includes over 1.1 million characters, each assigned a specific and unique number just like in ASCII and other legacy encodings. Those numbers are also called code points.

Lets forget about computers for a moment to simply concentrate on Unicode and its code points. If I wrote a sentence using only Unicode code points, you could transcribe it in letters and symbols, just by matching these code points to their assigned character.

Before we proceed, there is a sensible nuance that most articles discussing character sets and Unicode don't spend enough time emphasizing, or go the opposite way by adding so much details that it confuses the reader. As a result, the foundations necessary to understand the general idea remain brittle. In this article I'll press on it to the point of annoyance, you can thank me later. So pay attention, here it goes:

Unicode code points are just numbers that map to characters.

Repeat this in your head 5 times and then some. Did I mention anything about computers? I think not. Again:

Unicode code points are just numbers that map to characters.

In my opinion, one of the first things that confuse people the most about Unicode is that there's a premature connection made between Unicode and encodings. I believe the reason to be that code points are most often represented in hexadecimal. In reality, they don't need to be, that's just a convention.

Unicode code points are just numbers that map to characters.

I didn't say Hexadecimal numbers, I said numbers. Also notice how I didn't mention encoding in that sentence? I haven't forgotten anything. As far as Unicode is concerned, there's no computer and no encoding. The map really is that straight forward.

Code point -> character

Simple as that. No need to complicate this explanation by mentioning how characters will be represented in a computer and a whole other set of complexities. Think of Unicode as Morse Code, but with numbers instead of dots and dashes. You want to send a message in Unicode you grab the Unicode map, a pen and some paper and you just proceed to write down those numbers instead of characters. I want to decode your message, I grab my Unicode map and do the translation the other way around. That's it. Unicode code points know nothing about computers, or hex, or encodings.

Now, you hopefully have a clue that Unicode code points are just numbers that map to characters. Lets continue.

To make the transition from ASCII to Unicode seamless, it was decided that ASCII characters would have the same code point value in Unicode. That is, 'a' which is 97 in ASCII will also be 97 in Unicode, '1' will still be 49 and so forth. The Unicode folks even went as far as making their code points also compatible with the then popular ISO-8859-1 (Latin-1) extended code points (128 - 255). That is 'é' which was 233 in Latin-1 will also be 233 in Unicode.

There's more to Unicode, but nothing that needs to be covered here to understand the big picture. By the end of this series you should be able to explore its darker corners with much less apprehension.

Lets recap on Unicode:

  • code points -> characters
  • code points 0 - 127 are the same as ASCII's
  • code points 128 - 255 are the same as Latin-1

In Part 3 we discuss some Unicode encoding attempts. Thank you for reading.