Outlook, Unicode, and Copy-Paste

It seemed straightforward: the customer told me they wanted the system to generate an email, and that they would send me the content (a parameterized one-liner). I copy/pasted it into the application source code as a string and parameterized it as specified. I tested the email, and it worked correctly.

Aside: At this point, someone will be saying the email content should have been placed as a resource or template file, and not a string in the source code. Such a claim is correct, but the email content is very short, so it wasn’t worth the extra effort. Also, it would have had no impact on this particular situation.

Having tested it myself, I commit’d the change and sent it to the user for testing. What happened next was a bit … unexpected.

Hi Steve,

It appears we are getting some special characters instead of the – for the Pending Timesheets Subject text. Not sure if we could encode this?

IMPORTANT – Timesheet Not Created – will effect pay

The customer interpreted the issue as a failure to HTML encode the content, except that:

  1. The dash doesn’t need to be HTML-encoded.
  2. I did HTML-encode the message and subject line, although the subject line doesn’t support HTML, so a failed encoding would show up as an explicit HTML entity not a random-seeming sequence of characters.

What it does look like is a Unicode representation issue. Was the dash somehow coming across as a multi-byte Unicode character to a client that didn’t support Unicode?

I tested again, and I wasn’t able to reproduce the problem, but I suspected I knew why it happened. A little investigation confirms that during typing, Outlook autoreplaces hyphen with en-dash when used in a stand-alone way.

Although the hyphen is a normal ASCII symbol (0x2D), the en-dash is represented by multi-byte Unicode code-point (0x2013, to be precise), so if anything non-Unicode compliant receives it, it will render as a sequence of characters.

When someone (the customer or one of his associates) typed the desired phrase “IMPORTANT – Timesheet Not Created … ” into Outlook, Outlook automatically replaced the hyphen with an en-dash, then I just copy/pasted that directly into my code (rather than retyping it, as Visual Studio does no such replacement).

How could I confirm my suspicion, since I couldn’t reproduce the issue myself? I used a hex editor (the 80s are back, baby!) to examine the actual content of the subject line from my source code. You can see in this hex screen the difference between what I copied from Outlook (top) and manually using the hyphen (bottom):

Hex comparison

Anyways the fix is easy: manually replace the en-dash with a regular hyphen.

I sent the email to myself, first with the original subject line (bottom) and then again with the hyphens. If you look closely, you can see that the dash looks different.

Subject line comparison

The hyphens should work fine in any situation.

What I never did figure out is … in this highly homogeneous Outlook-centric corporate environment, who was using a non-Unicode compliant email system and for what purpose?