Unicode, UTF-8 Tutorial

UTF8 Tutorial

This tutorial covers Unicode and its UTF-8 mapping standard. It will start at zero with explaining bits and binary structures, followed by an explanation of the ASCII, extended ASCII and Unicode character sets and ending with some conclusions on how to use UTF-8 en-/decoding with Flash & PHP.

What are bytes and bits?

Bit means “binary digit” and is the smallest unit of computerized data. A bit is a 2-base number, i.e. it has either the value of 0 or 1.
byte is an amount of memory, a certain collection of bits, originally variable in size but now almost always eight bits. This makes 28 or 256 possible values for a byte.

byte = 1 2 3 4 5 6 7 8
bit bit bit bit bit bit bit bit
1|0 1|0 1|0 1|0 1|0 1|0 1|0 1|0

Some example bytes could be 00000001 or 11111111 or 01010011.
Now how can we calculate the decimal value of this binary encoded byte. What we need is a conversion from base 2 to base 10.
Every 1 or 0 of these binary values is associated with an exponential of 2. For 8 bits it looks like the following:

byte = 1 2 3 4 5 6 7 8
128 (27) 64 (26) 32 (25) 16 (24) (23) (22) 2 (21) (20)
1|0 1|0 1|0 1|0 1|0 1|0 1|0 1|0

The calculation of the decimal equivalent of the binary value 00000001:

byte = 128 64 32 16 8 4 2 1
0 0 0 0 0 0 0 1 = 1

The calculation of the decimal equivalent of the binary value 11111111:

byte = 128 64 32 16 8 4 2 1
1 1 1 1 1 1 1 1 = 128+64+32+16+8+4+2+1 = 255

The calculation of the decimal equivalent of the binary value 01010011:

byte = 128 64 32 16 8 4 2 1
0 1 0 1 0 0 1 1 = 64+16+2+1 = 83

What is ASCII?

ASCII stands for American Standard Code for Information Interchange and is a standard for assigning numerical values to the set of letters in the Roman alphabet and typographic characters. The ASCII character set can be represented by 7 bits. This makes 27 or 128 different values resp. characters.
As ASCII uses only 7 of the 8 bits available of an byte the first bit is always 0: 0xxxxxxx;
Below there is a table of decimal values, their binary expressions and the character assigned to that value due to the ASCII standard. The first 32 characters are control characters. To read more :

Source : http://www.zehnet.de/2005/02/12/unicode-utf-8-tutorial/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s