Thanks to John Logan for input into some terms.

# Terms

8-bit encoding – Any encoding system which is restructured to a maximum of 256 characters. "8-bit" refers to the reference that 2^8 (28) characters are allocated in memory. Common 8-bit encoding systems include ISO-8859-1 (Latin 1), MacRoman, Windows 1252, KOI 8 and others. See more on 8-Bit Encoding.

A Terms

AAT font (Apple)Apple Advanced Typoography font. A vector font standard used in Apple fonts which uses ATSUI imaging engine. AAT and Microsoft/Adobe OTF fonts are normally compatible, but there are significant differences for South Asian fonts. Some AAT fonts like Devanagari MT may work only in limited applications such as TextEdit. Read the Apple AAT Documentation.

abjad – an alphabet with consonant symbols only with vowel symbols optionally represented by vowel points. Arabic, Phonecian and Hebrew scripts are abjads.

abugida – another term for a syllabic alphabet.

acute accent – the accent mark on vowels which slants forward as in ó.
Also known as "accent aigu" in French.

alphabet – any writing system in which each consonant and vowel is represented by a single character. Some alphabets are abjads (consonants only) and syllabic alphabets (consonants with vowel marks)

ANSI – another term for Microsoft’s Windows-1252 encoding for Western European characters.

Armscii – Encoding sytem developed for Armenian. Armscii is being gradually replaced by Unicode. Read more on Armenian and Armscii.

ASCII encoding – one of the earliest encoding systems developed for U.S. English. Originally developed in the 1960s, it is supported by almost all modern computers and computer software, even those sold outside the United States. The first 128 characters of Unicode match ASCII making conversion of English materials more seamless. See more on ASCII.

ASMO 449 and ASMO 708 – A set of Arabic encoding standards developed by the Arabic Standards and Measurement Organization. ASMO 708 was also adopted as ISO-8859-6.

ATSUI (Apple)Apple Type Services for Unicode Imaging. The mechanism used in Apple AAT fonts to determine placement of adjacent characters. This is normally compatible with Microsoft/Adobe OTF fonts, but there are significant differences for South Asian fonts. Read the Apple ATSUI specification.

AZERTY keyboard – Refers to French keyboard layouts where the Q key of the English keyboard is the A key in France keyboard.

AZERTY (Arabic) Keyboard – In some North African countries where French was a dominant language (especially North Africa), the Arabic letters in a keyboard utility are mapped following the French AZERTY keyboard. Otherwise letters are tylically mapped to match the QWERTY layout.

B Terms

Baltic (Latin 4)– Also known as ISO-8859-4. This is an 8-bit encoding system for Baltic languages
including Lithuanian, Latvian, Estonian, Sami and Greenlandic.

B.D.O. – Abbrevation for "bidirectional override". Pages in RTL right-to-left scripts like Hebrew and Arabic, may need to use the BDO tag to temporarily change the direction of a page for English text. English pages can use the BDO tag to include short passages of an RTL script.

BiDi Support – Abbreviation for "bidirectional support." Languages like Arabic and Hebrew need both RTL (right to left) support for native text and LTR (Left-to-right) support for foreign words written in the Latin alphabet.

Big 5 (Chinese) – a Traditional Chinese encoding developed by a group of five corporations in Taiwan. Like other CJK encoding systems, Big 5 includes the Roman alphabet.

Big Endian UTF-16 BE – A version of UTF-16 in which the four digits of the code are organized into block then code point in the block. For instance code 221B (cube root or ∛) would be chunked as 22.1B. This contrasts with Little Endian where the code point comes before the block.

binary (Base 2) Number – In computing, all data data respresented as a series of 0’s and 1’s or where 1 = ON and 0 = OFF. Different software packages convert these to human readable format. Programmers consolidate binary numbers to hexadecimal numbers (base 16) or octal numbers (base 8) for easier processing.

bit mapped font – a font standard from the 1970s and 1980s in which characters were mapped as a series of dots or bits. Unlike later vector fonts, bit mapped fonts could not be expanded without losing quality (they would like big chunky squares). Bit mapped fonts required separate files for each target size (e.g. 9 point, 12 point, 18 point etc). Bit mapped fonts are being phased out, but may be found in older sites/systems.

Brahmic script (India) – Also known as South Asian scripts. A series of related scripts in India and Sri Lanka descended from an original Brahmic script. These are distinct in that they are syllabic alphabets or abugidas. The best known Brahmic script in the West is probably Devanagari. the script used for Sanskrit, Hindi, Sinshi, Nepali and other languages of North India and Nepal.

Block (Unicode) – For a four-digit hexadecimal Unicode code point, the first two letters are the block. For instance in code 221B (cube root or ∛) the 22 is the block and 1B is the code point within the block. If a five or six-digit Unicode number is given, then the first characters are the plane, followed by the block and code point. See image below for how Unicode hexadecimal numbers are organized into planes and blocks.

Shows Linear B Woman Sign (triangle with head) at point Hex #010081 (01.00.81) 01=plane 00=block 81=codepoint in block

BMP (Basic Multilingual Plane) – Synonomous with Plane 0. This refers to Code points #0000 through #FFFF (65,535) in Unicode. Plane 0/BMP includes characters from all modern world scripts.

BOM (Byte Order Mark) – A hidden Unicode control character used to signal if a Unicode string is Little Endian, Big Endian or UTF-8. Not all UTF-8 files include a BOM character, but some do.

breve (short mark) – Another name for the cup mark above short
vowels as in ă. The breve is used in Romanian and phonetic transcription of many languages. It is contrasted with the macron or long mark.
Note: The breve is not encoded in Latin 1, and is best handled with UTF-8 Unicode encoding.See the Diacritics page to read more about accented vowels.

browser – an application to view Web pages. Common browsers include Internet Explorer, Firefox, Safari, Netscape, Mozilla or Opera.

C Terms

caron – another term for hachek or haček

character – A single letter or symbol. In internationalization, "character" can refer to either a glyph (appearence of the character) or to its code point (numeric position in an encoding system).

cedille (cedilla) – A small tail found underneath French "C cedille" as ç. This should not be confused with the ogonek (Polish cedilla) which faces the other direction.
Note: Only ç and Ç are encoded in Latin 1.

id=”centraleurope”Central European (Latin 2) – in internationalization, it’s the region of Europe where languages are written in the Roman alphabet, but are encoded as Latin 2, not Latin 1. Central European languages include Czech, Croatian, Hungarian, Polish, Serbian and Slovak.

CE Font – In Classic Mac environtments, the CE fonts (e.g. Times CE) fonts included Central European characters such as hachek, ogonek and double acute. These font still exist in System OS X, but newer Mac fonts may also include these characters.

circumflex accent – the accent mark on vowels which is an inverted triangle or "hat" as in ô.

Cyrillic – the technical name for the Russian alphabet, said to have been invented by St. Cyril. The Cyrillic alphabet is also used to write other languages including Turkic languages, and those languages may include letters not found in Russian.

CJK – abbreviation for Chinese-Japanese-Korean or for East Asian scripts in general. The term CJKV (where V is Vietnamese) may also be used.

code point – a numeric position within an encoding system such as ASCII, Unicode or Latin 1. Code points are matched with glyphs or chacters. For instance the letter L is code point 76 in ASCII/Unicode (or code point #004C in hexadecimal numbers).
Note: In some Unicode tables, the code point is the last two digits of a Unicode hexadecimal code.

collation rules (sorting) – rules from each language which determine "alphabetic order" or other order for its characters. For instance, the Welsh alphabet classifies NG as a separate letter following the letter G.

combining character – a Unicode character which is meant to be placed above or below the letter which precedes it. Combining characters can include accent marks or vowel marks. Note, many precomosed letters like é (code #00C9) are actually single characters, but rarer characters such as ɔ́ (#030F, #0300) are actually made of two characters – the base letter plus the combining character.

conjunct consonant (South Asian) – In Devanagari and other South Asian scripts, a consonant takes an a special "reduced" shape when it’s followed by a another consonant. These reduced forms are called conjunct consonants.

control character – These are code points which don’t represent a glyph or character but signify text formatting commands like right to left text vs. left to right text or which kind of line break you are using. ASCII has control characters just in positions #0-31 (and most software programs recognize them), but Unicode includes additional control characters that older programs don’t recognize.

control code (HTML) – another name for an HTML entity code.

CY Font (Cyrillic) – In Classic Mac environtments, the CY fonts (e.g. Times CY) fonts included Cyrillic characters. These font still exist in System OS X, but newer Mac fonts may also include these characters.
Note: The original encoding for these fonts was MacCyrillic which differs from other encodings..

D Terms

daseia (Greek) – Also spelled as dasia or δασεῖα. Although it is called a breath mark, it actually represents the /h/ sound. In appearence it resembles a left quote mark (‘).

DBCS (double byte character set) – a character set which allows up 4 hexadecimal digits or up to 216 characters. These were initially developed for East Asian languages like Chinese and Japanese.

id=”decimal”decimal (Base 10) number – the common number system with numerals 0-9. Decimal numbers contrast with hexadecimal numbers used in Unicode.

Devanagari (India) – A form of a Brahmic script used to write Sanskrit and several modern languages of North India including Hindi, Nepali, Marathi, Sindhi and several other languages.

diacritic – a cover term for any accent mark or symbol which appears above, below
or beside a letter. Diacritics also include vowel marks used in some scripts.

dialect (linguistics) – In linguistics, two forms are dialects of a language if speakers from different regions can understand each other despite minor variations. For instance, most speakers of U.S. English can understand British English even though there are variations in pronunciation and grammar.

dialect (political) – Usually a set of related spoken minority languages which are classified as "dialects" of a majority language even though forms are different enough to be classified as separate languages by linguists. In political dialects, speakers from different regions may claim speak the same "language", but cannot understand speakers from different regions. For instance a Chinese speaker from Hong Kong (Cantonese) may not be able to understand a speaker fom Shanghai (Wu) unless the Hong Kong speaker learns (Wu).
Note: In India, a language without a unique script may also be classified as a "dialect" instead of a "language."

dialytika (Greek) – Also spelled διαλυτικά. Greek term for diaresis or umlaut. Used only in polytonic Greek to mark two adjacent vowels which are pronounced separately.

diaresis – another name for umlaut accent.

diglossia – A situation in which one linguistic form is used in writing and educated speech and another form is used at home or when shopping. In many cases such as Arabic, Greek or Sinhala, the educated form is a much older form of the "modern" colloquial language used at home. In other cases, the educated form may be a separate language.

double acute – two acute accents above a vowel or consonant as in ő. Used in Hungarian and other languages.
Note: Double Acute is associated with Latin 2 encoding. See the Hungarian page for information on the double acoute.

double byte encoding – Larger encoding sets used for East Asian CJK scripts. "Double-byte" refers to the fact that 216 characters are used or two bytes (28) squared.

E Terms

ELOT928 – an 8-bit encoding system developed by the Greek government for modern monotonic Greek. ELOT928 is equivalent to ISO-8859-7.
Note: Ancient polytonic Greek is not covered by ELOT, but must be encoded in Unicode.

emoji – Picture icons used in Japanese text messages similar to emoticons, but also including weather symbols, holiday symbols, food and drink symbols and more. The use of symbols is so popular in Japan, that many are scheduled to be included in Unicode.

encoding – any computing scheme in which a character such as "a" is assigned a numeric value. Encoding systems include ASCII, Latin 1, Unicode and many
other ones designed for specific languages or regions.

entity code – Sometimes known as control code. In HTML, these are codes use to insert non-ASCII symbols and non-English codes. Examples include © or its equivalent numeric entity code © for the © copyright symbol. All entity codes begin with an & and end with a semi-colon. See HTML Entity Codes for more information and examples.

EUC-CN (GB2312) – More commonly known as GB2312
Note: "EUC" is an abbreviation for Extended Unix Code.

EUC-JP – An encoding system for Japanese predating Unicode. This is used primarily on Unix systems and was meant to consolidate multiple JIS encodings.
Note: "EUC" is an abbreviation for Extended Unix Code.

EUC-KR – An encoding system for Korean predating Unicode.
Note: "EUC" is an abbreviation for Extended Unix Code.

EUC-TW – a Traditional Chinese encoding based on CNS 11643 and developed by the government of Taiwan. Like other CJK encoding systems, EUC-TW includes the Roman alphabet.
Note: "EUC" is an abbreviation for Extended Unix Code.

F Term

Farsi – the name for Persian within the Persian language spoken in Iran. However, some speakers of Iranian heritage may prefer the term Persian over Farsi.

Furigana (Hurigana) – A style of Japanese writing in which phonetic Katakana and Hiragana are placed above Kanji (Chinese) characters in order to provide a pronunciation hint. The RUBY specification for vertical writing systems is designed primarily for Furigana writing.

G Terms

GB18030 – The most recent Simplified Chinese encoding standard issued by the government of the People’s Republic of China and mandatated for all computers sold in China. The majority of Unicode code points are included in GB18030 Both I.B.M. and Microsoft have information about GB18030. The older encoding standard is GB2312.
Note: GB is Guojia Biaozhun or "national standard."

GB2312 – a Simplified Chinese encoding developed in 1980 by the government of the People’s Republic of China. Like other CJK encoding systems, GB2312 includes the Roman alphabet as well as the Greek alphabet and the Cyrillic alphabet. Note that the Windows version of "GB2312" includes GBK characters. The newest encoding standard is GB18030.
Note: GB is Guojia Biaozhun or "national standard."

GBK – Additional characters in GB2312. Note that the Windows version of "GB2312" includes GBK characters.

grave accent – the accent mark on vowels which slants backwards as in ò.

glyph – a term referring to the shape of a character, but not to its code point value. For instance the glyph A has at least three codes points – it’s #65 in the Latin alphabet, #913 in the Greek alphabet and #1040 in the Cyrillic alphabet.

H Terms

hachek (haček) accent – an accent used in Central European languages such as Czech and Slovenian and to mark some tones in African languages. In appearence it resembles a miniature "v" over a letter (or an upside-down circumflex) as in č,š. Another term is caron or wedge.
Note: The hachek is associated with Latin 2 encoding.

halant – a mark found in many Indian scripts which indicates that a consonant is NOT folllowed by a vowel. It is often placed in the same types of location as other vowel marks.

Hangul – The name for the Korean writing system. This is a syllabic alphabet in which individual symbols are composed of
consonantal symbols combined with additional symbols for each vowel. Read more about Hangul.

Hanja – The Korean name for Chinese characters included in a Hangul text. Hanja are most similar to Traditional Chinese characters.
Note: Hanja are rare in modern Korean texts.

Hanzi (Hànzi) – The Mandarin Chinese name for the ideographic characters used in CJK writing systems. These are also called Kanji in Japanese and Hanja in Korean. The older form of Hanzi is Traditional Chinese and the newer formed used in China is Simplified Chinese.

-Hans and -Hant – In Chinese language tags, these indicate that a Chinese text is written in the Simplified Chinese script or Traditional Chinese script respectively.

hexadecimal (base 16) number – A number system which includes the numbers 0-9, plus A-F (10-15) respectively. It is used by programmers as a way of consolidating the representation of underlying binary numbers. Unicode code points are officially given in terms of hexidecimal values, although decimal values can also be used in some situations.

Hentaigana – an older, more ornate form of Hiragana used in formal Japanese documents such as diplomas and shop names.

Hiragana – one of the native phonetic syllabary scripts of Japanese. Hiragana (literally "women’s writing") is circular with larger loops and is used for specifying certain grammatical endings or okurigana. The other common syllabary is katakana. An older form of Hiragana is called Hentaigana.

Hurigana – Alternate spelling of Furigana.

HKSCS (Hong Kong Supplementary Character Set) – a Traditional Chinese encoding which also includes characters used in the Cantonese forms of Hong Kong. Like other CJK encoding systems, HKSCS includes the Roman alphabet.

I Terms

i18n – the abbreviation for "internationalization" where ’18’ refers to the fact that "internationalization" has 18 letters between the initial I and the final N.

ideogram – a symbol which represents a concept instead of a sound. Ideograms in the Roman alphabet include numerals like 1,2,3 and symbols like &,$,£ and ♥. Numerals are ideograms because pronunciation varies from language to language, but the meaning remains the same (e.g. 1 = one in English, but uno in Spanish).

ideographic script – a script in which each character represents a semantic concept (that is, most characters are ideograms). Chinese is the most commonly used ideographic script.

INSCRIPT keyboard (India) – Common keyboard layout for multiple scripts of India where the Caps Lock key toggles between the English keys and Indian script keys. Letters with the same phonetic meaning are found in the same position. For instance the same key would trigger either Devanagari (k) or Gujarati (k) depending on the user’s set up.

ISCII (Indian Script Code for Information Interchange) – An 8-bit encoding standard of India meant to facilitate conversion between Brahmic scripts of India. For instance Devanagari (k) and Gujarati (k) would be assigned the same code point. In theory a user could change scripts just by changing fonts or language setting, but many report practical difficulties with this. Most users are transitioning to Unicode. Read more about ISCII (India Department of Information Technology).

ISO – The International Standards Organization (www.iso.org) which is a body which registers a variety of international standards including encoding standards.

ISO-639 Language Tag – a two letter code indicating the language of a document or portion of a document. Codes have been registered with the ISO Organization. Read more about Language Tags.

ISO-639-2 Language Tag (LOC.gov) – Newer three letter codes indicating the language of a document or portion of a document. ISO-639-2 includes more languages than the original language tags, but not as many as later versions of the specification. These are being superceded by later versions including ISO-639-3 (below)

ISO-639-3 Language Tag (SIL.org) – The most recent set of three letter codes. Unlike previous versions, these codes include many minority languages needed for linguistic data. However, the recommendation for the Web is to use the two letter code if one exists, then only use three letter codes as needed. Read more about Language Tags.

ISO-2022 CJK Encodings – A standardized mechanism for encoding Chinese, Japanese and Korean and registered with the ISO Organization. These have been superceded by Unicode. Variants include:

ISO 8859 Encodings – A standardized mechanism for creating 8-bit encodings for different scripts and registered with the ISO Organization. In all ISO-8859 encodings include ASCII in positions #0-127 and the additional characters in positions #128-255. Registered variants are listed below:
Note: Almost all ISO-8859 characters can be replaced with Unicode.

  • ISO-8859-1 (Latin 1 for Western European)
  • ISO-8859-2 (Latin 2 for Central European)
  • ISO-8859-3 (Latin 3, supports Maltese and Esperanto)
  • ISO-8859-4 (Latin 4 for Baltic)
  • ISO-8859-5 (Cyrillic, somewhat rare)
  • ISO-8869-6 (Arabic, same as ASMO 708 somewhat rare)
  • ISO-8859-7 (same as ELOT928, Modern Monotonic Greek)
  • ISO-8859-8 (Visual Hebrew (Avoid))
  • ISO-8859-8-i (Logical Hebrew)
  • ISO-8859-9 (Turkish alphabet, also known as Latin 5 or incorrectly as Latin 9 )
  • ISO-8859-10 (Rare, Baltic with letters for Greenlandic and Sami, superceded by Unicode)
  • ISO-8859-11 (Avoid. Meant to match TIS-620 (Thai) but was rejected.)
  • ISO-8859-12 (Meant for Vietnamese or Indic, but could not be encoded)
  • ISO-8859-13 (Meant to include additional letters for Baltic languages)
  • ISO-8859-14 (Avoid. Meant for Celtic languages, but little software support implemented. Superceded by Unicode)
  • ISO-8859-15 (Latin 1 with Euro (€) symbol)
  • ISO-8859-16 (Avoid. Romanian alphabet, but little software support implemented)

ISO-10646 – Versions of Unicode as registered with the ISO Organization. The term is generally restricted to standards documentation.

Ivrit – An alternate name for Modern Hebrew, based on the actual pronuciation of עברי "Hebrew"

J Terms

Jaguar (Macintosh) – A code name for Macintosh system OS X 10.2. The next versions are 10.3 (Panther) and 10.4 (Tiger).

JIS Encodings – A series of Japanese encodings developed as an official Japanese Industrial Standard by the Japanese government. JIS encodings include JIS X 0201, JIS X 0208, JIS X 0212 and others. JIS Encodings are incorporated into EUC-JP, Shift-JIS and ISO-2022-JP.

K Terms

Kanji – The Japanese name for Chinese characters incorporated into Japanese texts. A typical Japanese text uses Kanji for common words and phrases. A with all phonetic scripts (Katakana, Hiragana, Rōmaji) would be considered for chilren or learners.
Note: The Chinese name for these characters is Hanzi.

Katakana – one of the native phonetic syllabary scripts of Japanese. Katakana is angular and resembles a streamlined Chinese character. Katakana is used for for foreign words, some company names, new Japanese words and other words in which a pronunciation needs to be specified.The other common syllabary is Hiragana.

Katakana Phonetic Extensions – Smaller versions of some Katakana syllables used to write the Ainu language of Northern Japan. These are encoded in the #31Fx block of Unicode separate from the original Katakana block.

Kedmanee (Thai Keyboard) – One of the two commonly supported layouts for Thai keyboards. This keyboard was developed first and is more commonly used.

keyboard – a utlity which allows you to change the mapping of your keys and characters. On a US computer, the default keyboard maps the D key to the D character, but if you switch to a Russian keyboard, the D key might generate a Д (Cyrillic D); switching to a Greek keyboard might cause the D key to generate a Δ (Greek delta). Using keyboards instead of just fonts ensures that your data is encoded properly.
Note: Keyboards can come in QWERTY/phonetic/homophonic/transliterated layouts or native layouts.

KOI-8 – an 8-bit encoding system for Cyrillic developed in the former Soviet Union. KOI-8 is primarily found on Unix machines; it differs significantly from Windows-1251 encoding.

L-Terms

l10n – the abbreviation for "localization" where ’10’ refers to the fact
that "localization" has ten letters between the initial L and the final N.

language (linguistics) – In linguistics, two related forms are considered separate languages if speakers cannot understand each other. For instance, English and German are related Germanic languages, yet an English speaker needs special training to understand German.
Note: Some forms like Scots/English are on the edge of being either distant dialects or very close languages.

language (political) – Separate "languages" which are actually linguistic dialects of one language. These are usually closely related dialects which may have their own literary tradition, spelling system or even separate governments. A classic example are the Scandinavian languages Danish, Sweedish and Norwegian which are related enough for speakers of all three languages to understand each other with minimal difficulty, yet are classified as separate languages for political reasons. Another example is Urdu (Arabic script) and Hindi (Devanagari script) which are similar in basic grammar (although they have different technical vocabulary).

Latin/Roman alphabet – the technical name for the Western European or English alphabet.
The name refers to the fact that this version of the alphabet was developed by the Latin speaking Roman Empire.

Latin 1 (ISO-8859-1) – An 8-bit encoding for Western European languages standardized by the ISO (Insternational Standards Organization). Western European languages include Spanish, German, Dutch, Italian, Scandinavian languages and non-European languages like Swahili and Tagalog whose alphabets are covered by the Latin 1 encoding. Non-Latin 1 languages include Central European Languages, Baltic Ianguages, Turkish, Welsh, Māori, Hawai’ian and other languages.

Characters missing in Latin 1 include the Euro symbol (€), characters like OE Ligature (œ) and Y umlaut (ÿ) from French and special punctuation symbols like the en-dash (), em-dash () and smart quotes (“ ”).
Note: Many American software products default to ISO-8859-1 encoding and may need to be changed to UTF-8.
Read more about Latin 1.

Latin 2 (ISO-8859-2) – An 8-bit encoding for Central European languages standardized by the ISO (International Standards Organization). Latin 2 characters include the hachek, ogonek and double acute.
Note: A transition to Unicode is recommended for these languages.

Latin 3 (ISO-8859-3) An 8-bit encoding for miscellaneous European languages standardized by the ISO (International Standards Organization). Languages supported by Latin 3 include Maltese and Esperanto.
Note: A transition to Unicode is recommended for these languages.

Latin 4 (ISO-8859-4) – An 8-bit encoding forBaltic languages standardized by the ISO (International Standards Organization). Languages supported by Latin 4 include Estonian, Lithuanian and Latvian, Sami and Greenlandic.
Note: A transition to Unicode is recommended for these languages.

Little Endian UTF-16 LE – A version of UTF-16 in which the four digits of the code are organized into code point first, then the block. For instance code 221B (cube root or ∛) would be chunked as 1B.22. This contrasts with Big Endian where the block comes before the code point.

localization (l10n)– Refers to the process of adapting the same content for different regions. Localization includes changing display of dates, currency symbols and punctuation as well as translation. For instance the U.S. date format of Jan 1, 2001 would be localized to 1 Jan, 2001 in the United Kingdom.

Logical Hebrew (ISO-8859-8-i) – An 8-bit encoding system registered with the ISO which was revised to include control characters to allow input of characters in proper text order. For instance in Logical Hebrew and Unicode, the word Shibboleth (שבלת
) can be typed in as "SH-B-L-T" and the word would be displayed in correct RTL (right to left) order. The older version, Visual Hebrew, requires developers to input the text backward (T-B-L-SH).

long mark – the horizontal long mark over Latin and Hawaiian long vowels as in ō. Also known as a macron.

long vowel – in many languages, words can be distinguished by how long a duration the vowel is pronounced. For instance Latin distinguishes mala ‘bad’ with a short vowel versus māla ‘apple’ with short vowel. Different languages may use accent marks to mark which vowels are long.

L.T.R. – an abbreviation for "left-to-right" writing. This contrasts with RTL right-to-left writing. Most writing systems, including that for English are LTR.

M Terms

macro language – a series of relatated spoken languages which are political dialects of a majority language. Examples of macrolanguages include "Chinese" (which includes Mandarin, Cantonese, Wu and other varieties) and "Arabic" (which includes Modern Standard Arabic, Iraqi Arabic, Egyptian Arabic, Levantine Arabic, Moroccan Arabic and other varieties).

macron – another name for the horizontal long mark over Latin and Hawaiian long vowels as in ō. Also known as makron in Greek. It is contrasted with the breve or short mark.
Note: The macron is not encoded in Latin 1, and is best handled with UTF-8 Unicode encoding. See the Diacritics page to read more about accented vowels.

Mac Classic – Refers to earlier Macintosh operating systems up through System 9 and is based on MacRoman. In System 10 (OS X), the operating system was changed to a Unix base and Unicode encoding.

MacRoman – The original 8-bit 256 character encoding used on the Macintosh up through System 9. MacRoman covered English, Spanish, French, German and other Western European languages. This was replaced by Unicode as of System OS X . Earlier versions of Macintosh also included encodings for different scripts such as MacCyrillic, MacArabic and so forth.

Mac System OS X – Also known as System 10. Refers to newer versions of the Mac operating systems which are based on Unix and encoded in Unicode.

monotonic (Modern Greek) – Refers to a simpler form of the Greek alphabet used mainly for Modern Greek. The name "monotonic" refers to the fact that multiple accents used Ancient Greek spellings were consolidated under one tonos (displayed as an acute accent). Monotonic Greek contrasts with Polytonic Greek spelling in which multiple accents are preserved.
Note: Monotonic Greek was only officially adopted in 1982, so some users still consider Polytonic Greek more correct.

N Terms

Naskh (Arabic) font – the style of Arabic letters used for modern Arabic text and modern Persian. This contrasts with Nastaliq forms used in Urdu and other languages.

Nastaliq (Arabic) font – a font in which the letters resemble the Nasta’līq form of Arabic calligraphy. Nastaliq writing is used for languages in Pakistan and Afghanistan, whereas Arabic and other languages use the Naskh form of the Arabic alphabet. Nastaliq fonts often include letters needed for Urdu, Pashto, Sindhi and neighboring languages which are not found in Arabic. Nastaliq was historically used for Persian, but many modern Persian documents are written in Naskh.

native keyboard layout – a keyboard for a non-Latin script in which the keys are mapped to match another country’s keyboard. For instance a native Russian keyboard layout may map the A key to Cyrillic Ф instead of Cyrillic A. This contrasts with a QWERTY/phonetic/homophonic/transliterated keyboard where the A key would be mapped to Cyrillic A.

nikud/nikkud/niqud/niqqud – the term used for different sets of points to optionally represent vowel sounds or other phonetic details in a Hebrew text. Most Hebrew documents are written with consonants only and do not include these marks. Some documents may include markings beyond the nikud set such as cantillation marks.

numeric entity code (HTML) – these are codes use to insert non-ASCII symbols and non-English codes (e.g © or hexadecimal ©) for the © copyright symbol. Numbers are equivalent to code points from Unicode. All entity codes begin with an &# and end with a semi-colon. Hexadecimal entity codes must include an x after the &#. See Unicode Numeric Codes for more information and examples.

O Terms

OpenType (OTF) Font – A vector font standard developed jointly by Adobe and Microsoft in the late 1990s and early 2000s to replace both Postscript and TrueType (TTF). Open Type fonts include additional information on ligatures, alternate forms and placement of adjacent letters. Although Macintosh computers can use most OTF fonts, OTF fonts for South Asian scripts are not fully supported by Apple. Apple uses the ATSUI scheme for their South Asicn fonts. Read the Microsoft OpenType specifications.

ogonek (Polish Cedilla) – a small tail found underneath some Polish vowels as in ǫ,ą. This should not be confused with the French cedille which faces the other direction.
Note: Ogonek is associated with Latin 2 encoding. See the Polish page for information on the ogonek.

Okurigana – Japanese case endings and grammatical endings written in Hiragana placed after Kanji (Chinese) characters. This allows Japanese to combine native grammatical information with ideographs from the Chinese writing system.

oxia (Ancient Greek) – Also spelled oxeia. The Greek term for grave accent. Used only in polytonic Greek.

P Terms

Panther (Macintosh) – A code name for Macintosh system OS X 10.3. The next version is 10.4 (Tiger); the preceding version is 10.2 (Jaguar).

Pattachote (Thai Keyboard) – One of the two commonly supported layouts for Thai keyboards. This keyboard was developed after the Kedmanee layout and is not as commonly used, but is said to be more efficient.

perispomeni (Ancient Greek) – Also spelled as περισπωμένη or perispomene. Although the name literally translates as "circumflex", the appearence is either a tilde or an "inverted breve" arch depending on the font. Found in polytonic Greek.

phonetic keyboard – a synonym for QWERTY keyboard

phonetic script – a script which is based on sound instead of concepts. Each character in a phonetic script respresents a distinct sound or syllable. Phonetic scripts include alphabets, syllabaries, abjads and abugidas.

Pinyin (Chinese) – the term used to refer to the system of transliterating Chinese words in the Latin (English) alphabet. This was developed in the 1950’s in Mainland China to help increase literacy.

plane (Unicode) – In a six-digit hexadecimal code, the first two number is the plane. For instance in a code 010081 (Linear B woman ideogram), the 01 would represent plane 1. If the code is four digits or less, then the plane is 0. Thus, code 221B (cube root or ∛ is equivalent to 00221B. See image below for how Unicode hexadecimal numbers are organized into planes and blocks.

Shows Linear B Woman Sign (triangle with head) at point Hex #010081 (01.00.81) 01=plane 00=block 81=codepoint in block

Plane 0 (Unicode) – Code points #0000 through #FFFF (65,535) in Unicode. This is the earliest block of Unicode which includes characters from all modern world scripts. At one point Unicode was limited to this plane, but was later expanded to additional planes in order to include ancient scripts and other symbols.
Note: Plane 0 is also known as the BMP (Basic Multilingual Plane)

Plane 1 (Unicode) – Code points #10000 through #1FFFF. An extra set of 16 planes, including this one, were created to handle characters that could not be fit into the first #FFFF (65,535) code points

polytonic (Ancient Greek) – Refers to the form of the Greek alphabet for Ancient Greek. The name "polytonic" refers to the fact that multiple accents used Ancient Greek spellings are preserved. Monotonic Greek contrasts with Monotonic Greek spelling in which only one accent is used.
Note: Greek specialists refer to accents by their Greek name. The Polytonic Greek accents are called:

Postscript font – A vector font standard co-developed by Adobe in the 1980s. This is being superseded by OpenType (OTF) fonts from Adobe which includes ligature and placement features for non-Western scripts.
Note: Even when TrueType (TTF) fonts were developed. Typographers tended to prefer Postscript because that standard had more ligature and placement control.

psili (Ancient Greek) – Also spelled as ψιλή or psile. Although it is called a "breath mark", it is used to indicate the absence of an a /h/ in front of a vowel (in contrasts with the daseia which is the accent mark for /h/). In appearence it resembles a right quote mark (’). Found only in polytonic Greek.

 

Q Terms

QWERTY Keyboard – a keyboard for a non-Latin script in which the keys are mapped to match the approximate Latin (English) alphabet letter value on a standard QWERTY keyboard. For instance, in a Russian QWERTY keyboard, Latin "D" would be Cyrillic "Д", and " Latin "P" would be Cyrillic "Π", and so forth. Other terms include phonetic keyboard or transliterated keyboard. This contrasts with native keyboard layouts where Latin keys are mapped following another country’s keyboard. For instance a native Russian keyboard layout may map the A key to Cyrillic Ф instead of Cyrillic A.

R Terms

reverse solidus – typographical term for "backslash" or \.

Roman/Latin Alphabet – the technical name for the Western European or English alphabet. The term comes from the fact that it was developed by the Romans and spread throughout Europe via the Empire.

id=”romaji”Rōmaji – In Japanese writing, this refers to words written in the Roman alphabet or "English characters"

R.T.L. – an abbreviation for "right-to-left" writing used in Hebrew, Arabic and other languages. This contrasts with LTR left-to-right writing.

RUBY – A specification primarily designed to handle Japanese Furigana writing. Ruby allows of gloss or pronunciation characters to be placed above or below a line of text. In Japanese, phonetic characters are place above Chinese Kanji characters. Read more about Ruby.

S Terms

SCBS – a single byte (28) character set containing up to 256 characters. Also known as an 8-bit encoding

Shift-JIS – A pre-Unicode Japenese encoding system developed by Microsoft which includes the JIS Japanese encoding systems JIS X 0201 and JIS X 0208. Shift JIS is also known as CP 932 within the Windows Operating system

Simplified Chinese – newer simplified form of the Chinese script devleoped by the People’s Republic of China. Also used in Singapore. As part of the of the simplification, several Traditional Chinese characters were collapsed into one character in Simplified Chinese. The advantage is that it may be more legible at smaller font sizes.

Smart Quotes – A tool in Microsoft Word and other text editors to convert straight quotation marks ("quote") to curiving quotation marks (“quote”). In terms of encoding though, the straight quote is within the ASCII character set, while the Smart quotes are two characters outside of ASCII and must be handed as Unicode characters online.
Note: Web pages where Smart Quotes are copied and pasted from Word may not display the characters correctly unless they are replaced by the entity codes “ (“) and ” (”) respectively.

SMP (Supplementary Language Plane) – This refers to Plane 1 or code points U+10000 to U+1FFFFF. This Plane includes ancient scripts, Emoji, specialized technical symbols, rarer South Asian scripts and other characters.

South Asian scripts – Also known as Brahmic scripts. This refers to the group of syllabic alphabetic or abugida scripts used in India, Sri
Lanka, Bangladesh, Pakistan and elsewhere descended from a prototype Brahmic script. Although alphabetic, they are sometimes treated as syllabaries because each
letter is a consonant and vowels are marked by diacrtics. South Asian script support is complex because vowel marks can be placed above, below, after or before a consonant. In addition, consonant clusters can sometimes be rendererd as unique conjunct consonants.

surrogate (code point) – a mechanism used by UTF-16 encodings to convert characters above Plane 0 (above xFFFF) into sequences of 4-byte characters.

syllabary – A system where syllable units (consonant+vowel) are represent by distinct symbols. True syllabaries include cuneiform, Japanese Hiragana/katakana and Cherokee.

syllabic alphabets – A system where characters are created by combining a base form for a consonant with a mark or symbol for a vowel. Although alphabetic in principle, the combination of base consonant plus vowel means they be treated differently from other alphabets. Other terms include abugida and alphasyllabary.

T Terms

TCVN (Vietnamese) – One of several encoding standard from the government of Viet Nam. Older versions of TCVN were created just for Vietnamese, but TCVN 6069 is based on Unicode.

TIS-620 (Thai) – A pre-Unicode 8-bit encoding for the Thai script developed by the government of Thailand where TIS stands for "Thai Industrial Standard." Another Thai encoding is Windows-874 developed by Microsoft. Newer documents can be encoded in Unicode.

Thorn – The letter þ used in Icelandic and Old English to represent the "hard th" sound. For many years, Mac users had difficult with Old English support because this letter was missing in the MacRoman encoding.
Note: Mac System OS X fonts and keyboards now include thorn support.

Tiger (Macintosh) – A code name for Macintosh system OS X 10.4. The preceding versions were 10.3 (Panther) and 10.2 (Jaguar).

tilde – the "wavy" accent used in Spanish and Portuguese as in õ or ñ.
Note: Only ñ,ã,õ are found in Latin 1. Other vowels with tildes (e.g. ) are found in other blocks of Unicode. See the Diacritics page to read more about accented vowels.

tonos (Modern Greek) – the single accent used in monotonic Greek spelling; officially it is identical to an acute accent, but it may have other appeareances in different fonts. In polytonic Greek, tonos is a generic name for any accent mark

Traditional Chinese – the older form of the script and is used in Taiwan, Hong Kong, and other locations outside of China, including various "Chinatowns" in the West. Chinese Traditional characters are more complex and more numerous, but are not as phonetically based as Simplified Chinese (and may be more understood by non-Mandarin speaking populations).

transliteration – the process of converting sounds from one script into another script. For instance the Russuain word Руский "Russian language (Cyrillic)" would be transliterated as Ruskiy "Russian language " in the Latin Alphabet.

transliterated keyboard – a synonym for QWERTY keyboard

TrueType Font (TTF) – A vector font standard co-developed by Apple and Microsoft in the late 1980s as an alternate to the Adobe Postscript technology. TTF fonts developed for WIndows can be used in System OS X Macs, but had to undergo a conversion for Classic Macintosh systems. TrueType is adequate for the Latin and Cyrillic alphabets, but is being superseded by OpenType (OTF) and ATSUI (Apple) for other scripts.
Note: Typographers tended to prefer Postscript because that standard had more ligature and placement control

TSCII (Tamil Standard Code for Information Interchange) – Proposed encoding 8-bit encoding scheme for Tamil begun in 1997. Most experts recommend Unicode encoding whenever possible. Read more about TSCII

U Terms

UCS-2Obselete. An older version of representing Unicode developed by Microsoft in which in which each character is represented by four hexadecimal numbers. Unlike the newer UTF-16, there is no mechanism to access characters above Plane 0, so many developers recommend not using this.

umlaut – the double-dot accent used in languages like German and others as in
ö. Also called diaresis in some languages like Spanish.

Unicode – Unicode, also known as UTF-8 or the "Universal Alphabet" is a an ordered set of over a million characters covering the majority of writing systems in the world. Unlike older systems, Unicode allows multiple writing systems to co-exist in one data file. Systems which recognize Unicode can consistently read and process data from many languages. Read more about Unicode.

id=”unicodefont”Unicode font – a font in which characters are placed in the numeric slot corresponding to their Unicode code point. A large Unicode font can contain thousands of characters for multiple scripts.

Uniscribe – The mechanism used in Windows determine placement of adjacent characters OTF fonts.

UTF-8 – the most commonly used form of Unicode used for Web pages and e-mail. Characters are translated into 8-bit chunks. An advantage of UTF-8 is that ASCII characters have the same representation in UTF-8 (for instance capital L = #4C in both ASCII and UTF-8).
Note: "UTF" is an abbreviation for Unicode Transformation Format.

UTF-16 – A version of Unicode in which each character is represented by four hexadecimal numbers. For instance capital L is #004C in UTF-16. UTF-16 comes in Big Endian and Little Endian versions. This is most recommended for databases, but takes up more storage space than UTF-8. The first two digits represents Unicode blocks. For characters above (xFFFF) in Plane 1, UTF-16 uses a surrogate code point mechanism to represent these characters.
Note: "UTF" is an abbreviation for Unicode Transformation Format.

UTF-32 – A version of Unicode in which each character is represented by six hexadecimal numbers. For instance capital L is #00004C in UTF-32. The extra digits represent Unicode planes. See image below for how Unicode hexadecimal numbers are organized into planes and blocks.
Note: "UTF" is an abbreviation for Unicode Transformation Format.

Shows Linear B Woman Sign (triangle with head) at point Hex #010081 (01.00.81) 01=plane 00=block 81=codepoint in block

V Terms

varia (Greek) – Also spelled βαρεῖα or Bareia (where B = /v/). The Greek term for acute accent. Found in both monotonic and polytonic Greek.

vector font – a font in which the characters are defined as mathematically defined shapes or outlines which can be expanded or contracted as needed. These are considered superior to bit mapped fonts because they could be displayed at any size and printed at high resolutions. The newest vector font standards are OpenType (OTF) and ATSUI (Apple). Older standards are TrueType (TTF) and Postscript.

Vietnamese encoding – Although Vietnamese is written in the Latin alphabet, its writing system includes tone marks not covered in Western European Latin 1 encoding. Older documents are encoded in VISCII, VPS, TCVN (Unicode) or Windows-1258, but newer documents are covered by Unicode.
Note: The Unicode specification includes a special Vietnamese character block.

Vista – A version of Windows released in 2007.

Visual Hebrew (ISO-8859-8) – An 8-bit encoding system registered with the ISO without control characters to allow input of characters in proper text order. Developers recommend Logical Hebrew or Unicode because Visual Hebrew requires developers to input the text backward

vowel mark – for scripts in which letters represent consonants only, vowel mark diacritics can be added to indicate which vowel is used after the consonant. Some scripts like those in South Asia always include vowel marks, but in others like Arabic and Hebrew, they are optional.

vrachy – The Greek term for the breve or short mark. Found only polytonic Greek.

W Terms

Wade-Giles (Chinese) – the older transliteration system for writing Chinese words in the Latin alphabet. For instance, Peking and Canton are Wade-Giles, but Beijing and Guangdong are the newer pinyin versions. Most specialists today use pinyin to transliterate Mandarin Chinese.

Western Europe (Latin 1) – in internationalization, this refers to languages written in the Roman alphabet and encoded as Latin 1. Western European languages include French, Spanish, German, Scandinavian languages and non-European languages like Swahili and Tagalog whose alphabets are covered by Latin 1 encoding.

Windows 32 – A cover term for recent versions of Windows which use 32-bit processing.

id=”winencodings”Windows encodings – Refers to a series of encoding standards developed by Microsoft for different scripts. Windows encodings include

  • Win-1250 (Central European)
  • Win-1251 (Russian)
  • Win-1252 (Western European)
  • Win-1253 (Modern Greek)
  • Win-1254 (Turkish)
  • Win-1255 (Hebrew)
  • Win-1256 (Arabic)
  • Win-1257 (Baltic)
  • Windows-1258 (Vietnamese)
  • Windows-874 (Thai)

Windows-1250 – An 8-bit encoding for Central European languages developed by Microsoft.

Windows-1251 – An 8-bit encoding for Russian developed by Microsoft. It differs significantly from KOI-8 developed by the Soviet Union.

Windows-1252 – An 8-bit encoding for Western European languages developed by Microsoft. It is similar to the Latin 1 encoding, but includes extra characters not in Latin 1. Windows-1252 is also known as ANSI.

Windows-1253 – An 8-bit encoding for Modern monotonic Greek languages developed by Microsoft. It is similar to the ELOT928/ISO-8859-7 encoding, but with a few differences.

Windows-1255 – An 8-bit encoding for Hebrew developed by Microsoft. Other Hebrew encodings include Logical Hebrew (ISO-8859-8-1) and Visual Hebrew ISO-559-8. Some developers recommend avoiding Windows-1255 because it replaces key control characters with extra vowel signs.

Windows-1256 – An 8-bit encoding for Arabic developed by Microsoft.

X Terms

No X Terms Yet

Y Terms

ypogegrammeni (Greek) – Also called iota subscript. A subscript diacritic, usually resembling an ogonek or backwards cedille, written beneath some Greek vowels or adjacent letters.

Z Terms

Zip File – a file in which a large file or a folder with multiple files have been compressed into a single .zip file. Zip files are used to deliver fonts and applications to other users.

Top of Page