Encoding on the Internet
0: Types of Scripts
Before encoding issues can be understood, it’s important to understand the design
of different types of scripts.
- Script Vs. Language Support
- Alphabets (Left-To-Right)
- Right-to-Left Alphabets
- Syllabic Alphabet
- Links and References
Script vs. Language Support
Technically, you can support any language you want on the Internet – just transliterate it into an English alphabet with no accent marks or non-U.S.punctuation and the language is "supported". Although this is what happens in e-mail and chat-rooms all across the world, many people would vastly prefer to be able to use the Internet in their native scripts.
However, the needs of data transmission and the designs of different scripts make placing non-English material online a complex problem. Computers and software were initially designed with just English and mind, and still have to be re-engineered to handle other scripts.
Scripts are Not Languages
It is important that although some languages are associated with certain defined scripts, the script and language are not identical. Some languages like Serbian (written in both Latin and Cyrillic scripts) or Kurdish which can be written in multiple scripts depending on location or current convention.
There are also some closely related languages (close enough for speakers to understand each other) in which each language is written in a separate script. One of these pairs is Hindi and Urdu. Hindi (N. India, a primarily Hindu area) is written in Devanagari, the writing system associated with Hinduism, while Urdu (Pakistan, a primarily Islamic country) is written in the Arabic script.
An alphabet is a system where each letter represents a single consonant or vowel. Because computers are based on English, a left to right alphabet is relativelyeasy to encode. Some notable alphabets include:
- Roman Alphabet – The alphabet used to write English is actually the Roman or
Latin alphabet. It was developed by the Romans based on Greek and Etruscan forms, then spread throughout Europe during the Roman Empire then to different parts of the worlds by different European powers.
- Cyrillic Alphabet – The Russian alphabet which combines elements from the Greek and Roman alphabets. Cyrillic is used to write many languages in the former Soviet Union including Ukrainian, Belarusian, Uzbek, and more.
- Greek Alphabet – First developed by the Ancient Greeks based on Phoenecian latter forms.
- Other Alphabets – Armenian, Georgian, Runic, Coptic, Somali, Phonetic Symbols, Ogam, Braille and others others
Several Middle Eastern alphabet including Arabic and Hebrew are written right-to-left, so directionality should be specified in the HTML.
In addition, most Arabic and some Hebrew letter forms vary depending on whether a consonant is at the beginning, middle or the end of a word (this was done to make manuscript writing faster). In computing terms, the same letter may need to be displayed in several alternate formats depending on its position in the word.
Note: All alphabets originate from an ancient Semitic script developed by Bronze Age turquoise miners working in Egypt. Although the forms were based on hieroglyphics, the sounds are based on Semitic words, not Egyptian.
A syllabary script is one in which one symbol represents a single syllable (consonant-vowel) sequence. These scripts were developed for languages in which most syllables end in a vowel, so the writing of these languages is more compact. Examples of syllabaries include:
- Japanese Katakana & Hiragana – Two parallel syllabaries used in Japanese used to specify the phonetic pronunciation of different words in various contexts. Katakana is the more angular system similar in appearance to Chinese, while Hiragana is the more circular system associated with "women’s writing." The Japanese writing system also incorporates Chinese characters.
- Cherokee – Developed by the Cherokee scholar Sequoyah as a "native writing" system. The forms are based on the Roman alphabet, but the values are not related to the English values. Another syllabary, the Ojibwa or Canadian Aborigonal syllabary was also invented and variants been used for Ojibwa, Cree and Blackfoot. In the Ojibwa syllabary, the orientation of the letter (up/down/left/right) indicates which vowel comes after it.
- Cuneiform – The script used on clay tablets for Sumerian, Babylonian, Akkadian and Hititte. An alphabetic form based on the Sinai Semitic alphabet was used to write the Semitic language Ugaritic (therefore form does not always indicate script type).
- Linear B – The script used to write the oldest Greek documents found in Mycaenae. The latter Greeks adopted the Phoenician (Semitic) alphabet where it evolved into the Greek alphabet.
- Other Syllabaries – Other societies have independently syllabary scripts, but many have been replaced by the Roman alphabet or some other script.
Although writing is shorter, more symbols are needed because there are more possible combinations of consonant+vowel. The encoding of syllabaries therefore requires larger fonts and different allocations of memory.
Scripts in which write consonants as the main letters, but then use special symbols or "vowel marks" to indicate which vowels follow the consonant are called "syllabic alphabets" These are true alphabets in the sense that consonants and vowels are independently written, but because the vowels and consonants combine toform complex characters, they can be difficult to display.
Most South Asian and Southeast Asian scripts are syllabic alphabets; these include
Hindi (Devanagari), Tamil, Gujarati, Bengali, Thai, Balinese, Hmong, Thai,Tibetan and many others. Korean Hangul can also be classified as a syllabicalphabet.
In an ideographic script, a character is used to represent one concept, regardless of pronunciation. The Chinese script is considered ideographic, although there are methods to write phonetic pronunciations. The Chinese script is probably the largest in terms of characters, because each concept requires a symbol. Compounding is commonly used to create additional words.
English speakers are also commonly exposed to two additional ideographic systems
- Emoticons/Emoji – The smiling faces (😀), hearts (❤) and coffee cups ( ☕) used in social media and email are images which are understood independent of a spoken language.
- Mathematical Symbols – a number symbol such has 8 is pronounced as eight in English, but ocho in Spanish or bat in Basque
Links and References
- Unicode Tutorials – includes tutorials on different scripts with reference to Unicode
- Ancient Scripts Com
Rogers, Henry (2004) Writing Systems: A Linguistic Approach – a good introduction to the basics.