This series of articles aims at explaining Unicode, UTF8 and other legacy encodings and how they relate to each other. I make little assumptions as to what the reader's prior understanding of certain computing concepts are. That is I'll try to go from what I perceive to be good basics to introduce the topic^{1}. I will sometimes therefore linger on some details that I believe to be key to have a solid foundation to better grasp the subject.
In this first lesson we brush over the decimal, binary and hexadecimal numeral systems, we talk about ASCII and its shortcomings. This will be a ramp for some and a reminder for others, but I feel that it could be a solid base upon which to develop further explanations.
The decimal, binary and hexadecimal numeral systems
It is speculated that humans have adopted ten symbols to represent numbers because they have ten fingers. Although we still use that system to develop the bulk of our understanding of maths, science and nature, we could have done it with another numeral systems using two, three, twenty, hundreds or billions of symbols, each representing a different "quantity" or "value". Working with ten symbols representing ten incrementing quantities starting from a zero value is called working in Base10 (baseten) or decimal.
To increment values in decimal, you start from 0, go all the way up to 9, you carry a value over to the left and you start over:
0, 1, 2, 3, 4, ... , 8, 9, 10, 11, ... , 19, ...
zero, one, two, three, four, ... , eight, nine, ten, eleven, ... , nineteen, ...
The common modern computer as currently designed only understands electricity. Like most electronic devices it can really only detect the presence or absence of an electrical charge. If we were to attribute a quantitative value to each of these states, we'd effectively only have a "null" or "zero" value to represent the absence of electricity and a "one" value to symbolize its presence. And if we wanted a computer to count using only those two states, we would also need to perform these operations with a numeral system that best matches such limitations. The binary numeral system, also known as Base2 only uses two symbols, 0 and 1. A perfect fit.
To increment quantities in binary, you start from 0 like in decimal, but you go up to 1 before carrying a value over to the left and start over:
0, 1, 10, 11, 100, 101, 110, ... , 111111, 1000000, 1000001, 1000010, ...
zero, one, two, three, four, five, six, ... , sixtythree, ...
The quantity "twenty three" expressed in decimal would be: 23 and in binary 10111.
You can see that as useful as binary is for electronics, it can quickly become tedious to work with for humans.
To make things easier, we need a way to convert their decimal operations in binary and viceversa. But that again isn't straight forward. Fortunately, it's pretty simple to go back and forth between binary and hexadecimal (base16) and so is going between decimal (base10) and hexadecimal (base16). Thus hexadecimal was found to be a good compromise between base10 and base2 and people who commonly need to perform binary maths have learned to work in hex directly. It's not decimal, but it's easier on the eyes and the brains than binary.
Hexadecimal numbers are represented using sixteen symbols: 09 for the first ten, and past these, AF for quantities ten to fifteen. In computing, hex numbers are generally represented with a leading 0 followed by an x. e.g. 0x21 tells us this isn't a decimal 21 (twentyone) but rather hex 21 (thirtythree), a completely different value.
decimal  binary  hexadecimal

0  0  0x0
1  1  0x1
2  10  0x2
3  11  0x3
4  100  0x4
...  ...  ...
10  1010  0xa
11  1011  0xb
12  1100  0xc
...  ...  ...
15  1111  0xf
16  10000  0x10
17  10001  0x11
18  10010  0x12
19  10011  0x13
20  10100  0x14
21  10101  0x15
...  ...  ...
29  11101  0x1d
30  11110  0x1e
31  11111  0x1f
32  100000  0x20
33  100001  0x21
...  ...  ...
Converting between decimal and hex is not going to be very relevant to our topic. Being able to convert between binary and hex on the other hand might be more helpful and is fortunately also rather easy. Keeping in mind that the hex numeral system spans 0  F, wich is the equivalent in binary to 0  1111, to convert a binary number to its hex equivalent, simply gather the binary digits in groups of 4 starting from the right. Pad the leftmost remaining group with 0 to the left if you want, then simply convert each group to its hex equivalent:
10011 = 0001 0011 = 0x13
10111111 = 1011 1111 = 0xbf
1010011010 = 0010 1001 1010 = 0x29a
To convert the other way around, you do the reverse, you replace each hex value by its binary equivalent.
0xa8 = 1010 1000
0xc1 = 1100 0001
0x92d = 1001 0010 1101
Bits and bytes
Computers process data in 8bit packets called Bytes (aka octets):
1010 1111 = 0xaf (1 byte)
0001 0011 1010 1111 = 0x13af (2 bytes)
0101 1010 1011 1010 1001 0010 = 0x5aba92 (3 bytes)
Character sets: a simple explanation
Computers don't actually understand characters. In reality, characters, whether from a document, on the command line, or in a browser window, are represented as strings of bytes (i.e. batches of 8bit packets). To display them on screen, a decoder processes the bit stream and maps each number it identifies to yet another type of data, containing additional instructions relevant to a graphical device, whose purpose is to draw shapes on screen (in this case the characters). This is the big picture and an overly simplistic one, but I think for our purpose it'll do. In a nutshell, characters are stored as numbers that are mapped to instructions for graphical devices that draw shapes on screen for our benefit.
ASCII
ASCII was an effort to create such a character map. 128 (one hundred and twenty eight) characters to be exact, numbered 0 to 127. The map includes upper and lower case english letters, numbers, punctuations, some symbols, as well as some white spaces, formatting characters (spaces, tabs, new lines, etc), as well as some unprintable characters. Some examples of ASCII characters and their decimal, binary and hexadecimal representation.
Char : Dec : Binary : Hex

'A' : 65 : 0100 0001 : 0x41
'a' : 97 : 0110 0001 : 0x61
'b' : 98 : 0110 0010 : 0x62
'1' : 49 : 0011 0001 : 0x31
'!' : 33 : 0010 0001 : 0x21
'~' : 126 : 0111 1110 : 0x7e
Note that if ~ is at position 126, it means that it's the 127th character (first character on the map is at position 0). Therefore the 128th and last ASCII character (an unprintable) is at 127, or 0111 1111 in binary, or 0x7f in hex. ASCII doesn't include accented characters, nor other nonenglish letters and symbols.
Extended Character Sets: Latin1 and others
As mentioned previously the last ASCII character is at binary position 0111 1111. You probably noticed the unused leading bit, yes? Well, so did a few folks who really wanted characters such as ç, ñ, ß and other such glyphs foreign to the english language. They thus proceeded to extend the ASCII set for their own purpose by switching that leading bit on. It then gave them an additional range of 128 new possible characters (1000 0000 to 1111 1111). Unfortunately, not everyone agreed as to what character should go where and so, each came with their own implementation. Thus began a plethora of different character sets that had everything in common from position 0 to 127 (ASCII) and then diverged from 128 to 255 (some would eventually overlap on some subsets). A couple of them, such ISO88591 also known as Latin1, are still very much in use today.
To recap:
 ASCII was designed with 128 character positions (0  127)
 the last position being at 0111 1111 (127).
 ASCII requires only 1 byte as characters go from 0000 0000 to 0111 1111.
 ASCII didn't include nonenglish characters.
 nonenglish speakers needed them.
 to extend ASCII the unused leading bit in the byte was turned on (1000 0000)
 it provided 128 more positions for new characters (1000 0000 to 1111 1111).
 though, not everyone agreed on which character should go where.
 so ensued various incompatible implementations from position 128 onward.
Part 2 of the series introduces Unicode. Thank you for reading.

If you feel that the basics as taken by my approach are still above your understanding, please let me know about it by writing to michael [at] this domain. I don't guarantee that I'll be able to make things clearer, but I might certainly try. ↩