What is the Longest String Resulting from a Normalized Single Unicode Code Point

Introduction

This article deals with Unicode Normalization, Composition, and Decomposition. If you are new to the topic, please read the introduction

Summary

Unicode defines 4 Normalization Forms (source: Unicode)

Form Description
Normalization Form D (NFD) Canonical Decomposition
Normalization Form C (NFC) Canonical Decomposition,
followed by Canonical Composition
Normalization Form KD (NFKD) Compatibility Decomposition
Normalization Form KC (NFKC) Compatibility Decomposition,
followed by Canonical Composition

which may result in different sequences of Unicode Code Points, depending on the chosen normalization, but still considered equal according to the Unicode (source: Wikipedia)

Amélie with its two canonically equivalent Unicode forms (NFC and NFD)
NFC character A m é l i e
NFC code point 0041 006d 00e9 006c 0069 0065
NFD code point 0041 006d 0065 0301 006c 0069 0065
NFD character A m e ◌́ l i e

Experiment

Lets iterate through all code points in the BMP and collect the lengths of their normalizations in all forms:

  foreach (var nf in new[] { NormalizationForm.FormC, NormalizationForm.FormD, 
    NormalizationForm.FormKC, NormalizationForm.FormKD })
  {
	var chars = new Dictionary<int, List<string>>();
	chars.Add(0, new List<string>());

	for (var i = 0; i < 65536; i++)
	{
	  var s = new string((char)i, 1);

	  try
	  {
		var l = s.Normalize(nf).Length;

		if (!chars.ContainsKey(l))
		  chars.Add(l, new List<string>());

		chars[l].Add(s);
	  }
	  catch
	  {
		chars[0].Add(s);
	  }
	}

	Console.WriteLine(nf.ToString());
	foreach (var kv in chars.OrderBy(d => d.Key))
	{
	  Console.WriteLine("length " + kv.Key + " count " + kv.Value.Count);
	}
  }

This results in the output

FormC
length 0 count 2082
length 1 count 63376
length 2 count 76
length 3 count 2
FormD
length 0 count 2082
length 1 count 51258
length 2 count 1167
length 3 count 10993
length 4 count 36
FormKC
length 0 count 2082
length 1 count 62290
length 2 count 692
length 3 count 401
length 4 count 52
length 5 count 15
length 6 count 2
length 8 count 1
length 18 count 1
FormKD
length 0 count 2082
length 1 count 50151
length 2 count 1750
length 3 count 11412
length 4 count 108
length 5 count 16
length 6 count 14
length 7 count 1
length 8 count 1
length 18 count 1

Wait, 18??

To find out what’s going on, I rewrote the code as WinForm application to get support for Unicode fonts.

As it turns out, the longest sequences depending on the selected Normalization Form are:

FormC

nothing truly spectacular, just two code points resulting in length 3

שּׁ (U+fb2c): 3, ש (U+05e9), ּ (U+05bc), ׁ (U+05c1)
שּׂ (U+fb2d): 3, ש (U+05e9), ּ (U+05bc), ׂ (U+05c2)

FormD

length 4 for some characters in the Greek Extended Block.

FormKC

length 4, 5, 6: denormalization of Roman numerals, parenthesized numbers, and Japanese and Latin SQUARE abbreviations (such as units of measurement, etc.)

Ⅷ (U+2167): 4, V (U+0056), I (U+0049), I (U+0049), I (U+0049)
⑽ (U+247d): 4, ( (U+0028), 1 (U+0031), 0 (U+0030), ) (U+0029)
㌳ (U+3333): 4, フ (U+30d5), ィ (U+30a3), ー (U+30fc), ト (U+30c8)

length 8: ‘ARABIC LIGATURE JALLAJALALOUHOU’ (U+FDFB)
ﷻ (U+fdfb): 8, ج (U+062c), ل (U+0644),   (U+0020), ج (U+062c), ل (U+0644), ا (U+0627), ل (U+0644), ه (U+0647)

length 18: ‘ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM’ (U+FDFA)
ﷺ (U+fdfa): 18, ص (U+0635), ل (U+0644), ى (U+0649),   (U+0020), ا (U+0627), ل (U+0644), ل (U+0644), ه (U+0647),   (U+0020), ع (U+0639), ل (U+0644), ي (U+064a), ه (U+0647),   (U+0020), و (U+0648), س (U+0633), ل (U+0644), م (U+0645)

FormKD

length 4, 5, 6: Again Greek, Parenthesized Numerals and Hangul, SQUARE Japanese and Latin abbreviations

length 7: ‘PARENTHESIZED KOREAN CHARACTER OJEON’ (U+321D)

length 8 as under FormKC

PS

Software often implements more features than described in standards. Have a look at this SO question on the abuse of COMBINING marks to see one result, and my explanation.

ด้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้дด็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: