Stripping Accents from Strings in C#

Unicode defines a concept called normalization (Unicode, Wikipedia) to define the equivalence of composed and decomposed representations of characters.

In .Net, the string.Normalize() method can be used to convert strings between normalization forms. If a string is in normalization form NormalizationForm.FormKD (full compatibility decomposition), the combing and modified marks are stored as separate characters, and their Unicode category can be retrieved calling the GetUnicodeCategory() method.

Thus, stripping the characters of a string from their accents, one has to perform the following steps:

  • Normalize the string into full compatibility decomposition
  • Remove the characters belonging to a “Mark” category
  • Return the result

Here is the C# code implementing this function:

using System.Text;
using System.Globalization;

public string StripAccents(string s)
{
  StringBuilder sb = new StringBuilder();
  foreach (char c in s.Normalize(NormalizationForm.FormKD))
    switch (CharUnicodeInfo.GetUnicodeCategory(c))
    {
      case UnicodeCategory.NonSpacingMark:
      case UnicodeCategory.SpacingCombiningMark:
      case UnicodeCategory.EnclosingMark:
        break;

      default:
        sb.Append(c);
        break;
    }
  return sb.ToString();
}
About these ads

One Response to Stripping Accents from Strings in C#

  1. Sam says:

    Thanks for this. It really helped me. With a lot of confusing stuff out there this made it very simple.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 65 other followers

%d bloggers like this: