Unicode hasn’t been part of my life enough recently, but it did emerge in a very unexpected way this week to during a recent calendar upgrade.

One of the conversion tasks was for us to add group e-mail addresses so we could share calendars among each other efficiently. But when I tried to copy and paste, I got a “not found error.” Here is one of these addresses (altered for security reasons):

umg-sc.foo.staff@fuyu.ucal.psu.edu

Can you spot the problem (HINT: Try cutting and pasting into a text file).

Given up? The problem is the hyphen. In the right font, you will see that it’s not just a hyphen (U+002D or ASCII #45), but actually the more elegant and slightly longer en dash which is actually U+2013 (not in ASCII). As many of you know, many databases are still sensitive to differences, so a hyphen is just not the same as an en dash. Theis means searching is a FAIL.

How did the en-dash get in there if it’s outside of ASCII? My guess is that it’s a result of an auto-correct feature from Word which makes some formatting tweaks to enhance visual appeal. One is to change plain hyphens into a slightly longer en-dash (more favored by typographers).

Another common change is to convert plain straight quotes (” at U+0022 or ASCII #34) to “Smart Quotes” like (“ at U+201C) and (” at U+201D).
Copying HTML code attributes from Word can be similarly dangerous since HTML recognizes plain quotes, but NOT fancy double quotes. Most of the time, the change does nothing, but when it comes to interacting with some systems, the reformatting makes a difference in a very annoying way.

How to catch it? In some cases, you can change the font, but many fonts make the dash and en-dash appear identical (Arggh!). Which leaves the old standdy (test,test,test) plus some Unicode awareness (which is increasing among programmers).

Share →

Leave a Reply

Skip to toolbar