Encoding on the Internet
What is Unicode?
Although multiple encoding standards have been developed and implemented for multiple scripts, developers realized that a single encoding scheme covering all scripts in the world was needed in order to facilitate data exchange around the world.
Unicode (www.unicode.org) is a global encoding scheme which has been working to include all characters in all scripts in a single global encoding system. Development began in the late 1980s and still continues on multiple fronts, but Unicode currently covers the majority of modern scripts in use.
- One web page to display multiple scripts from anywhere in the world. Previous encodings were restricted to only a few scripts.
- Text files from any computer in any language to be exchanged.
- Font support to be standardized across platforms. Today Unicode device include cell phones, tablets, computers and othe devices.
- Each character in each script to have one encoding code point
Organization of Unicode
Unicode is structured as follows
- Characters #0-255 (x00-xFF in hexadeximal notation) are Latin 1 (ISO-8859-1)
- Characters #0-127 (x00-x7F) – ASCII
- Characters #128-255 (x80-xFF)- Rest of Latin 1
- Blocks – All characters are organized in blocks by
script or symbol type. Originally this allowed scripts to preserve alphabetical or other appropriate sort orders, but many scripts have had new characters added through differetn versions. See the Unicode Charts (www.unicode.org/charts/) for details.
- Planes – Another organizational strategy is to divide Unicode into blocks of 216 or 65,536 (xFFFF). Characters U+0000 – U+FFFF are said to be in Plane 0 or BMP (Basic Multilingual Plane) and only need 4 hexadecimal digits in their code points. Beyond that point lies Planes 1-16 which define U+10000 to U+FFFFF. This allows over one million potential code points, but not all of them have been assigned. Codes that have been assigned in these planes typically include Emoji, ancient scripts, specialized symbols and less common CJK ideographs.
Unicode Code Point Notation
Unicode character numbers or code points are given in hexadecimal (base 16) format preceded by "U+", but in some cases the decimal version of the code point is used instead. For instance Cyrillic capital L (Л) is formally designated U+041B which converts to a decimal value of #1051.
Hexadecimal (Base 16) Numbers
Computing, and therefore character encoding systems for computers, is based on powers of two. Sixteen (24) happens to be a convenient organizational unit in encoding specifications.
Hexadecimal notation is a type of numbering scheme which groups quanitities into groups of 16 (or base 16) instead of 10. The notation consisists of the digits 0-9, plus A-F for numbers 10-15. Hexadecimal numbers are sometimes, but not always, preceded by the character x. One example is the equal sign character (=) which is U+003D or x3D. This is equivalent to 3 x 16 plus 13 (i.e. 48+13) or decimal #61.
Unicode Transforms (UTF-8 and UTF-16)
Although characters in a Unicode text file could theoretically contain four-five digits each, in practice, many files, particularly those from Western European languages, would not need all the digits. Therefore "transforms" such as UTF-8 and UTF-16 were developed to allow transmission of Unicode characters into smaller chunks as needed. It also allows a certain amount backwards compatability for older systems which can only handle 8-bit encoding blocks.
For Web pages, Unicode data is typically stored in a utf-8 format. Other platforms including database applications and operating systems may use other versions of Unicode.
Test Pages and Progress
The most recent operating systems and browsers support Unicode. You can view the pages below to see which scripts your browsers support.
If a script is not supported on uyour browser, check the By Language pages to learn more abouf font downloads and other support issues.
Although many software packages support Unicode, not all of them do.
- Unicode Organization – Includes charts,
updates, listserv and more.
- Links to PDF Unicode Charts
– Organized by scripts.
- W3C Internationalization – The Articles link includes basic tutorials
- Unicode Block Viewer – View Unicode codes by script block. Proper fonts must be installed.
Most of these sites are bloglike, but can contain valuable technical references
- Alan Wood’s Unicode Resources
- Evertype (Michael Everson)
- Evertype Unicode Proposals
- Macchiato (Mark Davis)
- Tex Texin, International Guy
- Joel on Software: What Every Programmer Should Know about Unicode – Tutorial on how different implementations of Unicode differ.
Unicode in Operating Systems
- Windows and Unicode (Microsoft)
OS X and Unicode
- Mac String Programming Guide
- Mac Typography
- Linux and Unicode
- Linux Keyboard Entry – May only work in some versions
- Unix – Information on Unicode in the Linux and Unix environment.