A Gentle Introduction to Unicode and Encoding (Part 1)

This series of articles aims at explaining Unicode, UTF-8 and other legacy encodings and how they relate to each other. I make little assumptions as to what the reader's prior understanding of certain computing concepts are. That is I'll try to go from what I perceive to be good basics to introduce the topic1. I will sometimes therefore linger on some details that I believe to be key to have a solid foundation to better grasp the subject.

In this first lesson we brush over the decimal, binary and hexadecimal numeral systems, we talk about ASCII and its shortcomings. This will be a ramp for some and a reminder for others, but I feel that it could be a solid base upon which to develop further explanations.

The decimal, binary and hexadecimal numeral systems

It is speculated that humans have adopted ten symbols to represent numbers because they have ten fingers. Although we still use that system to develop the bulk of our understanding of maths, science and nature, we could have done it with another numeral systems using two, three, twenty, hundreds or billions of symbols, each representing a different "quantity" or "value". Working with ten symbols representing ten incrementing quantities starting from a zero value is called working in Base-10 (base-ten) or decimal.

To increment values in decimal, you start from 0, go all the way up to 9, you carry a value over to the left and you start over:

0,    1,   2,   3,     4, ... ,    8,     9,    10,  11, ... ,     19, ... 
zero, one, two, three, four, ... , eight, nine, ten, eleven, ... , nineteen, ...

The common modern computer as currently designed only understands electricity. Like most electronic devices it can really only detect the presence or absence of an electrical charge. If we were to attribute a quantitative value to each of these states, we'd effectively only have a "null" or "zero" value to represent the absence of electricity and a "one" value to symbolize its presence. And if we wanted a computer to count using only those two states, we would also need to perform these operations with a numeral system that best matches such limitations. The binary numeral system, also known as Base-2 only uses two symbols, 0 and 1. A perfect fit.

To increment quantities in binary, you start from 0 like in decimal, but you go up to 1 before carrying a value over to the left and start over:

0,    1,   10,  11,    100,  101,  110, ... , 111111,      1000000, 1000001, 1000010, ...
zero, one, two, three, four, five, six, ... , sixty-three, ...

The quantity "twenty three" expressed in decimal would be: 23 and in binary 10111.

You can see that as useful as binary is for electronics, it can quickly become tedious to work with for humans.

To make things easier, we need a way to convert their decimal operations in binary and vice-versa. But that again isn't straight forward. Fortunately, it's pretty simple to go back and forth between binary and hexadecimal (base-16) and so is going between decimal (base-10) and hexadecimal (base-16). Thus hexadecimal was found to be a good compromise between base-10 and base-2 and people who commonly need to perform binary maths have learned to work in hex directly. It's not decimal, but it's easier on the eyes and the brains than binary.

Hexadecimal numbers are represented using sixteen symbols: 0-9 for the first ten, and past these, A-F for quantities ten to fifteen. In computing, hex numbers are generally represented with a leading 0 followed by an x. e.g. 0x21 tells us this isn't a decimal 21 (twenty-one) but rather hex 21 (thirty-three), a completely different value.

 decimal |  binary | hexadecimal 
---------|---------|-------------
    0    |      0  |     0x0     
    1    |      1  |     0x1     
    2    |     10  |     0x2     
    3    |     11  |     0x3     
    4    |    100  |     0x4     
  ...    |    ...  |     ...     
   10    |   1010  |     0xa     
   11    |   1011  |     0xb     
   12    |   1100  |     0xc     
  ...    |   ...   |     ...     
   15    |   1111  |     0xf     
   16    |  10000  |    0x10     
   17    |  10001  |    0x11     
   18    |  10010  |    0x12     
   19    |  10011  |    0x13     
   20    |  10100  |    0x14     
   21    |  10101  |    0x15     
  ...    |    ...  |     ...     
   29    |  11101  |    0x1d     
   30    |  11110  |    0x1e     
   31    |  11111  |    0x1f     
   32    | 100000  |    0x20     
   33    | 100001  |    0x21     
  ...    |    ...  |     ...

Converting between decimal and hex is not going to be very relevant to our topic. Being able to convert between binary and hex on the other hand might be more helpful and is fortunately also rather easy. Keeping in mind that the hex numeral system spans 0 - F, wich is the equivalent in binary to 0 - 1111, to convert a binary number to its hex equivalent, simply gather the binary digits in groups of 4 starting from the right. Pad the leftmost remaining group with 0 to the left if you want, then simply convert each group to its hex equivalent:

10011        =       0001 0011 = 0x13
10111111     =       1011 1111 = 0xbf
1010011010   =  0010 1001 1010 = 0x29a

To convert the other way around, you do the reverse, you replace each hex value by its binary equivalent.

0xa8    = 1010 1000
0xc1    = 1100 0001
0x92d   = 1001 0010 1101

Bits and bytes

Computers process data in 8-bit packets called Bytes (aka octets):

                        1010 1111 = 0xaf     (1 byte)
            0001 0011   1010 1111 = 0x13af   (2 bytes)
0101 1010   1011 1010   1001 0010 = 0x5aba92 (3 bytes)

Character sets: a simple explanation

Computers don't actually understand characters. In reality, characters, whether from a document, on the command line, or in a browser window, are represented as strings of bytes (i.e. batches of 8-bit packets). To display them on screen, a decoder processes the bit stream and maps each number it identifies to yet another type of data, containing additional instructions relevant to a graphical device, whose purpose is to draw shapes on screen (in this case the characters). This is the big picture and an overly simplistic one, but I think for our purpose it'll do. In a nutshell, characters are stored as numbers that are mapped to instructions for graphical devices that draw shapes on screen for our benefit.

ASCII

ASCII was an effort to create such a character map. 128 (one hundred and twenty eight) characters to be exact, numbered 0 to 127. The map includes upper and lower case english letters, numbers, punctuations, some symbols, as well as some white spaces, formatting characters (spaces, tabs, new lines, etc), as well as some unprintable characters. Some examples of ASCII characters and their decimal, binary and hexadecimal representation.

    Char :   Dec  :   Binary      :   Hex
   ----------------------------------------
    'A'  :   65   :   0100 0001   :   0x41
    'a'  :   97   :   0110 0001   :   0x61
    'b'  :   98   :   0110 0010   :   0x62
    '1'  :   49   :   0011 0001   :   0x31
    '!'  :   33   :   0010 0001   :   0x21
    '~'  :   126  :   0111 1110   :   0x7e

Note that if ~ is at position 126, it means that it's the 127th character (first character on the map is at position 0). Therefore the 128th and last ASCII character (an unprintable) is at 127, or 0111 1111 in binary, or 0x7f in hex. ASCII doesn't include accented characters, nor other non-english letters and symbols.

Extended Character Sets: Latin-1 and others

As mentioned previously the last ASCII character is at binary position 0111 1111. You probably noticed the unused leading bit, yes? Well, so did a few folks who really wanted characters such as ç, ñ, ß and other such glyphs foreign to the english language. They thus proceeded to extend the ASCII set for their own purpose by switching that leading bit on. It then gave them an additional range of 128 new possible characters (1000 0000 to 1111 1111). Unfortunately, not everyone agreed as to what character should go where and so, each came with their own implementation. Thus began a plethora of different character sets that had everything in common from position 0 to 127 (ASCII) and then diverged from 128 to 255 (some would eventually overlap on some subsets). A couple of them, such ISO-8859-1 also known as Latin-1, are still very much in use today.

To recap:

  • ASCII was designed with 128 character positions (0 - 127)
  • the last position being at 0111 1111 (127).
  • ASCII requires only 1 byte as characters go from 0000 0000 to 0111 1111.
  • ASCII didn't include non-english characters.
  • non-english speakers needed them.
  • to extend ASCII the unused leading bit in the byte was turned on (1000 0000)
  • it provided 128 more positions for new characters (1000 0000 to 1111 1111).
  • though, not everyone agreed on which character should go where.
  • so ensued various incompatible implementations from position 128 onward.

Part 2 of the series introduces Unicode. Thank you for reading.


  1. If you feel that the basics as taken by my approach are still above your understanding, please let me know about it by writing to michael [at] this domain. I don't guarantee that I'll be able to make things clearer, but I might certainly try.