Encoding on the Internet

5: 8-Bit Non-Roman Encoding

The Problem

Although 256 characters can support most Western European languages, it is not enough to handle non-Roman characters or even languages written in the Roman alhpabet which fall outside of the Latin 1 character set. Therefore, other 8-bit encodings were developed for languages outside Western Europe.

Template

To accommodate both English and the other script, many 8-bit encodings are structured as follows:

  • Characters #0-127 – ASCII
  • Characters #128-255 – Other script.
Structure of Non-English Encodings
Script Encoding #0-127 #128-255
Arabic ISO-8859-6* (rarely used)
ASCII
Arabic
Greek ISO-8859-7*
ASCII
Greek
Hebrew ISO-8859-8*
ASCII
Hebrew

*External links to Wikipedia

On the Internet, if you switch the encoding View of your browser in most cases, you will still see English because the encoding supports it.

Behavior of Encoded Fonts

Because non-Roman encodings include ASCII, if you switch to a properly encoded font in word-processor font and begin to type, you will see English characters. It is not until you switch your keyboard, that the non-Roman letters appear.

Parallel Standards

"Windows" Encodings vs. ISO-8859-x

For many scripts, there is a competing Windows encoding standard and a non-Windows standard, typically one registered at the ISO as an ISO-8859-x set. For instance Hebrew Web pages can be encoded as either ISO-8859-8 ("Visual Hebrew") or as Windows-1255.

 

Variant Encodings by Script
Script ISO/Other Windows Encoding
Arabic ISO-8859-6 Windows-1256
Greek ISO-8859-7 ("ELOT") Windows-1253
Hebrew ISO-8859-8 ("Visual Hebrew") Windows-1255
Russian/Cyrillic KOI-8 Windows-1251
Thai TIS-620 Windows-874

Roman Script but Not Latin 1

In addition to the cases above, there are also languages which are written in the Latin alphabet, but include characters NOT in the ISO-8859-1 (Latin 1) encoding. These included "Central European" languages like Hungarian (with ő). Polish (ą,ł) and Czech (š,ů) as well as other languages like Turkish (ş,ǧ,ı) and Welsh (ŵ,ŷ)and ironically Latin (with ā,ē)

Like Arabic and Greek, they were placed in different 8-bit encoding systems which included the ASCII characters, but also the accented letters needed for a language.

Central European and Latin 2

The characters from the neighboring countries of Hungary, Poland, the former Czechoslovakia, the former Yugoslavia and Germany placed together in a variety of "Central European" encodings including ISO-8859-2 (aka Latin 2).

Variant Encodings for Central European
Script ISO/Other Windows Encoding
Central Europe ISO‑8859‑2 ("Latin 2") Windows-1250

Covered Latin 2 languages include Bosnian, Czech, Croatian, Hungarian, Polish, Serbian, Slovak, Slovenian and Sorbian.

In that era, computers would need access to parallel fonts which included these characters. Hence Mac users in the 90s would have both Times New Roman, Times New Roman CE and even Times New Roman CY with Cyrillic characters. Today, most versions of Times New Roman is based on Unicode and includes all the characters needed.

Other Latin Encodings

Beyond Central Europe, encodings were developed for the Baltic languages (Lithuanian, Lativian, Estoniani) and Turkish. Some theoretical encodings were developed but never fully implemented once Unicode became viable.

Variant Latin Encodings
Script ISO/Other Windows Encoding
Baltic ISO‑8859‑4 ("Latin 4") Windows-1257
Turkish ISO-8859-9 Windows-1254
Celtic
Never Implememted
ISO-8859-14 N/A

Links about Encoding

Top of Page | Encoding Tutorial Index

Skip to toolbar