The language tag or attribute is used to declare the language of a document or portion of a document. This is meant to assist search engine spiders, page formatting and screen reader technology.
Why Language Tags?
In the online archive world, there are two primary reasons for associating documents with specific languages – facilitatate global technology and facilitate metadata search in archives. Although the two reasons are valid, they are by no means identical. In some situations, one goal may be more than important than another.
Facilitate Global Technology
How do you select the right spell checker to use (French vs. English), the right font (Arabic vs. Urdu), the right way to pronounce c’est la vie (French "Say la vee" vs. English "Sest la v-eye" or the right set of "Quote Marks" (English) vs. «Quote Marks» (Spanish)?
You can documents with a language and program utilities that behave differently depending on the target language identified. This allows the same product (e.g. Microsoft Word) to be used but to include plugin spell checkers for different languages.
The caveat is that only written languages are usually targeted for these kinds of utilities. For instance, Microsoft has utilities for standard American English and standard British English, but not for spoken varieties Brooklyn English. Although a "Brooklyn" spellchecker and "Brooklyn" speech synthesizer could be programmed, many "Brooklyn" native speakers would probably find them condescending and not use them.
Facilitate Metadata Search in Archives
Aside from spellcheckers and speech synthesizers, researchers into specific dialects or historical forms need a way to tag their material into very narrow categories that would be irrelevant to most software vendors.
The caveat here is that a tag may be registered, but only supported by a very narrow range of specialized applications. An example of this would be the need for a Celtic database to distinguish Gaulish (xcg) vs. Celtiberian (xce) – two distinct ancient Celtic languages. On the other hand, it is unlikely that any speech synthesizer will pronounce words from these languages correctly.
When deciding how to tag documents, it may be important to consider whether you are tagging for general usage or for a narrow research purpose.
English (U.K./Great Britain)
In XHTML, the language is declared in the HEAD as follows:
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
NOTE: If you are writing in a right-to-left language like Arabic or Hebrew, you
should add the dir="rtl" attribute. See the Right Alignment options for more details.
Switching Languages in HTML
If you switch languages within one page, you can embed the <lang=>
attribute in other tags such as a <p>, <h1>, <span>
and other tags. For example
Foreign Language Test Text
This sentence is in English.
This sentence will be read with a British accent
Esta frase es en español. (Spanish)
Cette phrase est en français. (French)
Mae’r frawddeg hon yn cymraeg. (Welsh)
<p>This sentence is in English.</p>
<p lang="en-GB">This sentence will be read with a British accent</p>
<p lang="es">Esta frase es en español.</p> (Spanish)</p>
<p lang="fr">Cette phrase est en français</p> (French)
<p lang="cy">Mae’r frawddeg hon yn Cymraeg.</p>
CSS and Language Tags
It is possible to use CSS to format text based on its tagged language. See the CSS and Language Tags page for more information.
Some Common Language Codes
Language codes are primarily taken from the list of ISO-639 language codes. Some common codes, including all the languages taught at Penn State are listed
below. For the most
part, they are based on the native name (i.e. Español (es) for Spanish).
This language code list has recently been expanded to a three letter set (e.g. "eng" for English), from an older two-letter set. Therefore, some languages (particularly ancient languages) may have a three-letter code listed.
The By Language pages list the codes for each language, but common codes are listed below.
Commonly Taught Languages
- en: English
- es: Spanish
- fr: French
- it: Italian
- pt: Portuguese
- de: German
- ru: Russian
- ar: Arabic
- zh: Chinese (Mandarin)
- he: Hebrew
- ja: Japanese
- ko: Korean
- sw: Swahili
- grc: Ancient Greek (vs. el: Modern Greek)
- la: Latin
- he: Hebrew
- ang: Old English (Anglo-Saxon)
- enm: Middle English
These are codes where the language name diverges significantly from English.
- sq: Albanian
- hy: Armenian
- eu: Basque
- nl: Dutch
- ka: Georgian
- gd: Scottish Gaelic
- ga: Modern Irish
- fa: Persian (Farsi)
- bo: Tibetan
- cy: Welsh
Note on Screen Reader Support: Only the most recent versions of JAWS and Home Page Reader support the LANG tag
for French, Spanish, Portuguese, German and Finnish. To support other languages,
it is recommended that users install plug-ins or screen reader software designed
for other language.
Language Codes for Linguistics or Language Archives
For some situations though (e.g. China, different "varieties" of German), you may need to use older codes if you need them to be recognized by more software packages.
Specifying Language Dialects and Varieties
Language codes can be followed by an optional variety code, but note that not all codes are recognized by all vendors and that the line between "language" and "dialect" can be very fuzzy in some situations.
Until recently, the only way most vendors (e.g. Microsoft or Apple) distinguished languages was by attaching a ISO-3166 country code code after it. Although some "country codes" can be linguistically inaccurate, they may be the most standardized.
- en-US: American English
- en-GB: British English
- es-ES: Castillian Spanish (Spain)
- es-MX: Mexican Spanish (Standing for Latin American Spanish)
See also es-419 for Lating American Spanish
- fr-FR: Parisian French (France)
- fr-CA: Canadian French
- pt-BR: Brazillian Portuguese (standard)
- de-DE: Standard German
- de-CH: Swiss German
- zh-CN: Mandarin Chinese (China) (see also zh-Hans for Simplified Chinese)
- zh-TW: Mandarin Chinese (Taiwan) (see also zh-Hant for Traditional Chinese)
- zh-HK: Cantonese (Hong Kong)
RFC 4646 Tag Syntax
Recently, there has been an attempt to codify other types of regional varieties as part of the RFC 4646 project, but it is still a work in progress. Below are some guidelines for forming different types of varieties, but note that not all of them may be registered.
Check Registry First: Before using any subtag, confirm that it has been registered first in the IANA Language Subtag Registry. Otherwise assume it is a tag only you may be using.
If a language can be written in more than one script, then you may need to specify which script is in use, some of which are implemented in modern software systems such as Windows Vista. Common examples (all of which are registered) include:
- az-Arab – Azerbaijani, Arabic script
- az-Cyrl – Azerbaijani, Cyrillic script
- az-Latn – Azerbaijani, Latin script
- bs-Cyrl – Bosnian Cyrillic Script
- bs-Latn – Bosnian Latin Script
- zh-Hans – Simplified Chinese script
- zh-Hant – Traditional Chinese script
Many languages written multiple scripts have IANA registered variant, but not all of them do. If your language script variant does not exist, then the following script subtags can be used.
By Numeric World Region
If a regional variety is larger than a country, then it is recommended that region codes from the U.N. Numeric Macroregions List be used. The most prominent example is probably:
- es-419: Pan Latin American Spanish (Registered)
Another theoretical example could be en-021 (American and Canadian English), although this variant is NOT registered.
If you need a code not registered with the IANA, you can create new codes following suggested guidelines, but you may need to add an x-prefix to indicate that is it unregistered.
By the way, Anyone can request a new variant code at, but the process is lengthy.
Dialects within a Country
The RFC 4646 permits codes to be combined. So if you need to specify the Baltimore dialect of English, you could create a code such as
- en-US-Baltimore (theoretical)
Please note that no regional varieties from the United States are registered with the IANA (and only three from Britian).
Thus you can either use the code x-en-US-Baltimore to indicate it is not registered or just en-US-Baltimore depending on your needs. It is very likely most software packages would interpret the string as just en-US.
The RFC 4646 does not specify how to indicate time within a particular language, but some registered codes indicate dates for when spelling changes in a language were enacted. Some examples include:
- de-1901 – German, traditional spelling
- de-1996 – German, post 1996 spelling reform
ISO-639 Two-Letter Language Codes
Use these codes if available for your language
ISO-639-3 Codes for Linguists and Archivists
Use these codes if you cannot find an appropriate code for your language in the lists above, then use these.