Accepting user input: Beware of fullwidth characters

If your application is accepting user input you should be ready to treat all kinds of “unexpected” data. This holds true especially when the application is facing open internet hence you should always canonicalise (normalise) user input before processing it. Canonicalisation process should convert all different representations of data into one standard form simplifying further processing. Usual operations include normalising characters case, trimming spaces from beginning and end or removing duplicated spaced from between words. But there are another aspect we should take into consideration and one of them is different representations of the same character.

Unicode halfwidth and fullwidht forms

As you look into unicode charcters you can notice there are some characters duplicated, although they look different (how different will depend on the application used to display the text). Below lines are written using halfwidth and fullwidth forms:

Hello World!
Hello World!
Hello World!

The second line uses fullwidth form and you should see easily see the difference. On the other hand the third line uses only one fullwdith form character (!) and the difference is not so noticeable.

The problem with those two forms is that equality check will determine all strings are different which can lead to duplication in data. It becomes even trickier when our data is passed to other systems, like database or web services. For example the Sql Server performs string canonicalisation on indices and will throw an exception stating that duplicated entry already exists. Even worse, data is inserted in original form which may lead to inserting entry in non-canonical form.

If we want to avoid potential problems, our canonicalisation function has to determine whether two strings are the same no matter what form the use.
Fortunately the String class contains some handy methods to help us with canonicalisation.

String.Normalize() and String.IsNormalized()

To find whether given string is in canonical form we can use String.IsNormalized() and to canonicalise given string we use “String.Normalize() method. Both method have overloads accepting an enumeration which allows us to pick the normalisation form we want to use. In our case, for latin characters, we can use either KC or KD compatibility form, which will change all fullwidth form characters into halfwidth form. In addition it also normalises different forms of character composition[1]

You can read more about string normalisation in the Using Unicode Normalization to Represent Strings article. Below is sample method to canonicalise given text:

public string ToCanonical(string text)
{
   var nomalisedText = (text.IsNormalized(NormalizationForm.FormKD))
      ? text
      : text.Normalized(NormalizationForm.FormKD);

   var lowercaseTrimmedText = nomalisedText.Trim().ToLower();
   var spaceDeduplicatedText = lowercaseTrimmedText.Replace("  ", " ");

   return spaceDeduplicatedText;
}

Further reading:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s