Encoding on the Internet
6: Double Byte Encoding for East Asian Languages
Large Encodings for Non-Alphabets
The scripts discussed on the last page such as Greek, Hebrew, Arabic and Cyrillic, are all alphabetic or about the same size as the Roman alphabet. But for syllabary scripts or ideographic scripts, the repertoire of characters is can be larger than 256 characters. These scripts require an encoding scheme which can accommodate more characters.
As a result, 16-bit or double byte encoding sets (DBCS) allowing of tens of thousands characters were developed for these scripts. In practice, characters are organized in blocks of 192 characters.
Chinese Japanese and Korean (CJK)
Because many East Asian scripts incorporate Chinese characters, they are collectively known as CJK scripts, short for Chinese-Japanese-Korean. The scripts are not identical, but all of them are the same order of magnitude in size.
Notable CJK encodings from before Unicode include:
- Shift-JIS (Japanese)
- GB2312 (Simplified Chinese, Republic of China)
- Big5 (Traditional Chinese, Taiwan)
However Unicode is being used in many East Asian language sites.
Encoding Template
To accommodate both English and the other scripts, many 16-bit encodings are structured as follows:
- Characters #0-127 – ASCII
- Characters #128-… – Other scripts
- Block – Most DBCS encodings are organized into blocks by hexadecimal number. The first two hexadecimal numbers (the first byte) denotes a block, and the last two a character within that block.
For example Japanese character ァ(katakana A) in Shift-JIS is in position 833F or block 83, number 3F (83.3F). See this Shift-JIS table for an example.
Interestingly, many East Asian encodings also incorporate other scripts such as the Cyrillic and Greek alphabet. Some browsers, especially on the Mac platform, may use a Japanese font as the default for Cyrillic or Greek pages.
Mixing ASCII with DBCS Characters
In theory each character, even ASCII characters, might be specfied with a 4-digit character in double byte character set. In practice Western text might have characters with only one byte. One reason to do this is to save on memory.
Therefore, algorithms are needed to distinguish single-byte characters embedded in a text from double byte characters. These techniques are also needed in Unicode.