This tutorial covers Unicode and its UTF-8 mapping standard. It will start at zero with explaining bits and binary structures, followed by an explanation of the ASCII, extended ASCII and Unicode character sets and ending with some conclusions on how to use UTF-8 en-/decoding with Flash & PHP.
What are bytes and bits?
Bit means “binary digit” and is the smallest unit of computerized data. A bit is a 2-base number, i.e. it has either the value of 0 or 1.
A byte is an amount of memory, a certain collection of bits, originally variable in size but now almost always eight bits. This makes 2^{8} or 256 possible values for a byte.
byte = | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
bit | bit | bit | bit | bit | bit | bit | bit | |
1|0 | 1|0 | 1|0 | 1|0 | 1|0 | 1|0 | 1|0 | 1|0 |
Some example bytes could be 00000001 or 11111111 or 01010011.
Now how can we calculate the decimal value of this binary encoded byte. What we need is a conversion from base 2 to base 10.
Every 1 or 0 of these binary values is associated with an exponential of 2. For 8 bits it looks like the following:
byte = | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
128 (2^{7}) | 64 (2^{6}) | 32 (2^{5}) | 16 (2^{4}) | 8 (2^{3}) | 4 (2^{2}) | 2 (2^{1}) | 1 (2^{0}) | |
1|0 | 1|0 | 1|0 | 1|0 | 1|0 | 1|0 | 1|0 | 1|0 |
The calculation of the decimal equivalent of the binary value 00000001:
byte = | 128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | = 1 |
The calculation of the decimal equivalent of the binary value 11111111:
byte = | 128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 | |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | = 128+64+32+16+8+4+2+1 = 255 |
The calculation of the decimal equivalent of the binary value 01010011:
byte = | 128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 | |
0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | = 64+16+2+1 = 83 |
What is ASCII?
ASCII stands for American Standard Code for Information Interchange and is a standard for assigning numerical values to the set of letters in the Roman alphabet and typographic characters. The ASCII character set can be represented by 7 bits. This makes 2^{7} or 128 different values resp. characters.
As ASCII uses only 7 of the 8 bits available of an byte the first bit is always 0: 0xxxxxxx;
Below there is a table of decimal values, their binary expressions and the character assigned to that value due to the ASCII standard. The first 32 characters are control characters. To read more :
Source : http://www.zehnet.de/2005/02/12/unicode-utf-8-tutorial/