Unicode defines a concept called normalization (Unicode, Wikipedia) to define the equivalence of composed and decomposed representations of characters.
In .Net, the string.Normalize() method can be used to convert strings between normalization forms. If a string is in normalization form NormalizationForm.FormKD (full compatibility decomposition), the combing and modified marks are stored as separate characters, and their Unicode category can be retrieved calling the GetUnicodeCategory() method.
Thus, stripping the characters of a string from their accents, one has to perform the following steps:
- Normalize the string into full compatibility decomposition
- Remove the characters belonging to a “Mark” category
- Return the result
Here is the C# code implementing this function:
using System.Text; using System.Globalization; public string StripAccents(string s) { StringBuilder sb = new StringBuilder(); foreach (char c in s.Normalize(NormalizationForm.FormKD)) switch (CharUnicodeInfo.GetUnicodeCategory(c)) { case UnicodeCategory.NonSpacingMark: case UnicodeCategory.SpacingCombiningMark: case UnicodeCategory.EnclosingMark: break; default: sb.Append(c); break; } return sb.ToString(); }
Thanks for this. It really helped me. With a lot of confusing stuff out there this made it very simple.