Encoding on the Internet

7: Unicode

What is Unicode?

Although multiple encoding standards have been developed and implemented for multiple scripts, developers realized that a single encoding scheme covering all scripts in the world was needed in order to facilitate data exchange around the world.

Unicode (www.unicode.org) is a global encoding scheme which has been working to include all characters in all scripts in a single global encoding system. Development began in the late 1980s and still continues on multiple fronts, but Unicode currently covers the majority of modern scripts in use.

Advantages

Unicode allows:

  • One web page to display multiple scripts from anywhere in the world. Previous encodings were restricted to only a few scripts.
  • Text files from any computer in any language to be exchanged.
  • Font support to be standardized across platforms. Today Unicode device include cell phones, tablets, computers and othe devices.
  • Each character in each script to have one encoding code point

Organization of Unicode

Unicode is structured as follows

  • Characters #0-255 (x00-xFF in hexadeximal notation) are Latin 1 (ISO-8859-1)
    • Characters #0-127 (x00-x7F) – ASCII
    • Characters #128-255 (x80-xFF)- Rest of Latin 1
  • Blocks – All characters are organized in blocks by
    script or symbol type. Originally this allowed scripts to preserve alphabetical or other appropriate sort orders, but many scripts have had new characters added through differetn versions. See the Unicode Charts (www.unicode.org/charts/) for details.
  • Planes – Another organizational strategy is to divide Unicode into blocks of 216 or 65,536 (xFFFF). Characters U+0000 – U+FFFF are said to be in Plane 0 or BMP (Basic Multilingual Plane) and only need 4 hexadecimal digits in their code points. Beyond that point lies Planes 1-16 which define U+10000 to U+FFFFF. This allows over one million potential code points, but not all of them have been assigned. Codes that have been assigned in these planes typically include Emoji, ancient scripts, specialized symbols and less common CJK ideographs.

Unicode Code Point Notation

Unicode character numbers or code points are given in hexadecimal (base 16) format preceded by "U+", but in some cases the decimal version of the code point is used instead. For instance Cyrillic capital L (Л) is formally designated U+041B which converts to a decimal value of #1051.

Hexadecimal (Base 16) Numbers

Computing, and therefore character encoding systems for computers, is based on powers of two. Sixteen (24) happens to be a convenient organizational unit in encoding specifications.

Hexadecimal notation is a type of numbering scheme which groups quanitities into groups of 16 (or base 16) instead of 10. The notation consisists of the digits 0-9, plus A-F for numbers 10-15. Hexadecimal numbers are sometimes, but not always, preceded by the character x. One example is the equal sign character (=) which is U+003D or x3D. This is equivalent to 3 x 16 plus 13 (i.e. 48+13) or decimal #61.

Unicode Transforms (UTF-8 and UTF-16)

Although characters in a Unicode text file could theoretically contain four-five digits each, in practice, many files, particularly those from Western European languages, would not need all the digits. Therefore "transforms" such as UTF-8 and UTF-16 were developed to allow transmission of Unicode characters into smaller chunks as needed. It also allows a certain amount backwards compatability for older systems which can only handle 8-bit encoding blocks.

For Web pages, Unicode data is typically stored in a utf-8 format. Other platforms including database applications and operating systems may use other versions of Unicode.

Test Pages and Progress

Web Pages

The most recent operating systems and browsers support Unicode. You can view the pages below to see which scripts your browsers support.

If a script is not supported on uyour browser, check the By Language pages to learn more abouf font downloads and other support issues.

Other Software

Although many software packages support Unicode, not all of them do.

Links

Unicode Gurus

Most of these sites are bloglike, but can contain valuable technical references

Advanced Unicode

Unicode in Operating Systems

Top of Page | Encoding Tutorial Index

Skip to toolbar