Encoding on the Internet
5: 8-Bit Non-Roman Encoding
The Problem
Although 256 characters can support most Western European languages, it is not enough to handle non-Roman characters or even languages written in the Roman alhpabet which fall outside of the Latin 1 character set. Therefore, other 8-bit encodings were developed for languages outside Western Europe.
Template
To accommodate both English and the other script, many 8-bit encodings are structured as follows:
- Characters #0-127 – ASCII
- Characters #128-255 – Other script.
Script | Encoding | #0-127 | #128-255 |
---|---|---|---|
Arabic | ISO-8859-6* (rarely used) |
ASCII
|
Arabic
|
Greek | ISO-8859-7* |
ASCII
|
Greek
|
Hebrew | ISO-8859-8* |
ASCII
|
Hebrew
|
*External links to Wikipedia
On the Internet, if you switch the encoding View of your browser in most cases, you will still see English because the encoding supports it.
Behavior of Encoded Fonts
Because non-Roman encodings include ASCII, if you switch to a properly encoded font in word-processor font and begin to type, you will see English characters. It is not until you switch your keyboard, that the non-Roman letters appear.
Parallel Standards
"Windows" Encodings vs. ISO-8859-x
For many scripts, there is a competing Windows encoding standard and a non-Windows standard, typically one registered at the ISO as an ISO-8859-x set. For instance Hebrew Web pages can be encoded as either ISO-8859-8 ("Visual Hebrew") or as Windows-1255.
Script | ISO/Other | Windows Encoding |
---|---|---|
Arabic | ISO-8859-6 | Windows-1256 |
Greek | ISO-8859-7 ("ELOT") | Windows-1253 |
Hebrew | ISO-8859-8 ("Visual Hebrew") | Windows-1255 |
Russian/Cyrillic | KOI-8 | Windows-1251 |
Thai | TIS-620 | Windows-874 |
Roman Script but Not Latin 1
In addition to the cases above, there are also languages which are written in the Latin alphabet, but include characters NOT in the ISO-8859-1 (Latin 1) encoding. These included "Central European" languages like Hungarian (with ő). Polish (ą,ł) and Czech (š,ů) as well as other languages like Turkish (ş,ǧ,ı) and Welsh (ŵ,ŷ)and ironically Latin (with ā,ē)
Like Arabic and Greek, they were placed in different 8-bit encoding systems which included the ASCII characters, but also the accented letters needed for a language.
Central European and Latin 2
The characters from the neighboring countries of Hungary, Poland, the former Czechoslovakia, the former Yugoslavia and Germany placed together in a variety of "Central European" encodings including ISO-8859-2 (aka Latin 2).
Script | ISO/Other | Windows Encoding |
---|---|---|
Central Europe | ISO‑8859‑2 ("Latin 2") | Windows-1250 |
Covered Latin 2 languages include Bosnian, Czech, Croatian, Hungarian, Polish, Serbian, Slovak, Slovenian and Sorbian.
In that era, computers would need access to parallel fonts which included these characters. Hence Mac users in the 90s would have both Times New Roman, Times New Roman CE and even Times New Roman CY with Cyrillic characters. Today, most versions of Times New Roman is based on Unicode and includes all the characters needed.
Other Latin Encodings
Beyond Central Europe, encodings were developed for the Baltic languages (Lithuanian, Lativian, Estoniani) and Turkish. Some theoretical encodings were developed but never fully implemented once Unicode became viable.
Script | ISO/Other | Windows Encoding |
---|---|---|
Baltic | ISO‑8859‑4 ("Latin 4") | Windows-1257 |
Turkish | ISO-8859-9 | Windows-1254 |
Celtic Never Implememted |
ISO-8859-14 | N/A |