When you create a Web page with Unicode characters, it is recommended that you include the following character meta tag:

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...
</head>

And if it’s XHTML, you need to include a final “/” at the end.

The idea behind this tag is to force the other broswer into the correct view and prevent the display of Roman character gibberish. Sometimes though, you can place a properly formatted UTF-8 Web page (meta tag and all) and still see gibberish.

In this case the problem is not you, but the Web server, typically configured with Apache. If it’s an American server, Apache is probably set up to ONLY deliver ISO-8859-1 encoding and, even though your file has the UTF-8 data in it, the server is trying to deliver it as Latin 1 (hence the Latin 1 gibberish).

There are three possible solutions available when this happens

Talk to Your Server Admin

And when you do, you can politely suggest changing the httpd.conf file as documented on Seapine Software. You can also comment that most modern Web apps are set to serve UTF-8 data including CMS programs such as Plone, Movable Type and Drupal. Others such as Facebook and Twitter support UTF-8 natively.

I believe this is what a Web service having this issue did recently.

Use an .htaccess file to just configure specific directories and pages

If you’re comfortable enough to mess around with changing your directory preferences you can try this suggestion from Ted Texin about using AddType statements

The main proviso here is that an .htaccess file can do some serious damage unless you are careful. It’s possible that you may not be able to upload one into your directory because of this, but it could be a good solution to suggest to a server admin if only your directory is affected and the rest of the site has to be encoded differently.

Unicode Escape Codes

If neither of the above solutions is available, then you can deliver the content within any encoding…if you encode the “exotic” characters as Unicode numeric escape codes.

For example if your site is Latin 1, but you need to present Russian content you can change your code from

Русский

to

Русский

As you can imagine, this IS an absolute last resort solution. If you ever need to transfer content between systems, you will have many more problems with escape codes (none of which are supported in true XML or Microsoft Word). Not to mention the difficulty of replacing each character with it’s Unicode numeric equivalent. Escape codes were really only meant for short passages of text.

But…if this is where you are, then you can try either the old Mozilla Composer which converted anything you typed into escape codes or maybe you can try another utility. Truthfully it is extremely difficult problem to do convert raw UTF-8 text to HTML entitiy codes these days.

So I emphasize that this a rare problem and should be easily corrected by your server admin…and if it’s a personal Web site, you may want to think about alternative providers.

Or you could try the ultimate last resort – attack of the angry Unicode expert.

Post Script (Apr 3, 2009)

A student in a recent seminar pointed out a site which does convert a character to a decimal code reference at http://www-atm.physics.ox.ac.uk/user/iwi/charmap.html (from Alan Iwi at the Rutherford Lab at Oxford). Just enter or paste the character and click the the Make HTML button to see a decimal entity code. You can enter an entire string of characters.

Share →

Leave a Reply

Skip to toolbar