Analyzing Combining Unicode Characters

March 23, 2014

Some scripts supported by the Unicode standard define combining characters, which may cause confusion for people not familiar with a specific script:

For one of these questions, I even analyzed the character sequence manually. But this analysis is not much fun if you try perform it repeatedly.

Recently I stumbled upon the SE user name n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳, so I had the idea to write a small program to output the Unicode code points and character names for a given input, based on Unicode’s UnicodeData.txt file.

The output for the text samples in the above links looks like this:

unispell 1

unispell 2

unispell 3

unispell 4

unispell nhahtdh

The initial version of this program is available for download here.


Updating YuJisho: a Unicode CJK Character web dictionary

January 17, 2014

I deployed by first version of YuJisho nearly 4 years ago, and, as I developed more and more MVC applications since then, I felt it was time to migrate the original ASP.Net application to ASP.Net MVC.

ASP.Net allowed (supported?) really messy code, so the challenges for an MVC migration are:

  • Extract business logic from the presentation layer to the business layer
  • Re-write the markup from ASP: controls to use native HTML
  • Re-write postbacks as HttpPost actions (both <form> and Ajax requests)

The layout also got a facelift using basic Bootstrap (version 2) styling, but the UI migration is not yet complete.

The data remains unchanged, containing Unicode 5.2, but an upgrade to Unicode 6.3 and the latest dictionary data is in the pipeline.

Enjoy browsing and searching 😉


Different Strings considered Equal – Depending on your Collation

June 3, 2013

I recently came across this interesting question on SO:

Why does the statement

select case when  N'ܐܪܡܝܐ' = N'አማርኛ' then 1 else 0 end

return 1?

In case you are wondering, the first string seems to be Syriac/Aramaic for Aramaic, and the second string is Amharic for Amharic, as found on Wikipedia.

Obviously, the strings are not equal, but still the result of the comparison is true. But why?

The reason is, of course, that string comparison and sorting in SQL Server depend on the collation in use. And the collation used in the example was SQL_Latin1_General_CP1_CI_AS, which does not distinguish Aramaic and Amharic characters.

If you just want to test whether two strings are equal as in “equal Code Points” or “binary equal”, you need to select a specific collation which distinguishes the code points, such as Latin1_General_BIN:

select case 
    when N'ܐܪܡܝܐ' COLLATE Latin1_General_BIN = N'አማርኛ' COLLATE Latin1_General_BIN 
    then 1 else 0 end

So how can we analyze how collations affect comparison?

First, I created two tables: One stores records about character groups, the second stores the characters in each group that are considered equal by the collation:

CREATE TABLE [dbo].[CollationString](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Collation] [varchar](100) NOT NULL,
    [String] [nvarchar](50) NOT NULL,
    [StringInt] [int] NULL,
 CONSTRAINT [PK_CollationString] PRIMARY KEY CLUSTERED ([Id] ASC))
GO

CREATE TABLE [dbo].[CollationStringEquivalent](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [CollationStringId] [int] NOT NULL,
    [String] [nvarchar](50) NOT NULL,
    [StringInt] [int] NULL,
 CONSTRAINT [PK_CollationStringEquivalent] PRIMARY KEY CLUSTERED ([Id] ASC))
GO

I added a field Collation so that different collations can be compared later on.

Next, we iterate through all characters in the Unicode BMP, look up the character, and insert according to the result:

set nocount on 

declare @c int = 0
declare @coll nvarchar(50) = 'Latin1_General_CI_AS'
declare @cid int

while @c < 65536 begin

    print @c

    set @cid = null;
    select @cid = id 
        from collationstring 
        where collation collate Latin1_General_CI_AS 
            = @coll collate Latin1_General_CI_AS 
        and string = nchar(@c);
    
    if @cid is null begin
        insert into collationstring(collation, string, stringint) 
            values (@coll, NCHAR(@c), @c);
        select @cid = SCOPE_IDENTITY()
    end
        
    insert into collationstringequivalent(collationstringid, string, stringint) 
        values (@cid, NCHAR(@c), @c);

    set @c = @c + 1
end

After running this script, we can now query which characters are considered different

SELECT  * FROM [CollationString]

and which characters are equal

SELECT  * 
FROM [CollationStringEquivalent]
ORDER BY [CollationStringId], [StringInt]

To analyze the distribution of characters in one collation, we can start with a query like this

SELECT COUNT([Id]), [CollationStringId], MIN([String]), MIN([StringInt])
FROM [CollationStringEquivalent]
GROUP BY collationstringid
ORDER BY COUNT(id) DESC, MIN(stringint)

 


What is the Longest String Resulting from a Normalized Single Unicode Code Point

June 3, 2013

Introduction

This article deals with Unicode Normalization, Composition, and Decomposition. If you are new to the topic, please read the introduction

Summary

Unicode defines 4 Normalization Forms (source: Unicode)

Form Description
Normalization Form D (NFD) Canonical Decomposition
Normalization Form C (NFC) Canonical Decomposition,
followed by Canonical Composition
Normalization Form KD (NFKD) Compatibility Decomposition
Normalization Form KC (NFKC) Compatibility Decomposition,
followed by Canonical Composition

which may result in different sequences of Unicode Code Points, depending on the chosen normalization, but still considered equal according to the Unicode (source: Wikipedia)

Amélie with its two canonically equivalent Unicode forms (NFC and NFD)
NFC character A m é l i e
NFC code point 0041 006d 00e9 006c 0069 0065
NFD code point 0041 006d 0065 0301 006c 0069 0065
NFD character A m e ◌́ l i e

Experiment

Lets iterate through all code points in the BMP and collect the lengths of their normalizations in all forms:

  foreach (var nf in new[] { NormalizationForm.FormC, NormalizationForm.FormD, 
    NormalizationForm.FormKC, NormalizationForm.FormKD })
  {
	var chars = new Dictionary<int, List<string>>();
	chars.Add(0, new List<string>());

	for (var i = 0; i < 65536; i++)
	{
	  var s = new string((char)i, 1);

	  try
	  {
		var l = s.Normalize(nf).Length;

		if (!chars.ContainsKey(l))
		  chars.Add(l, new List<string>());

		chars[l].Add(s);
	  }
	  catch
	  {
		chars[0].Add(s);
	  }
	}

	Console.WriteLine(nf.ToString());
	foreach (var kv in chars.OrderBy(d => d.Key))
	{
	  Console.WriteLine("length " + kv.Key + " count " + kv.Value.Count);
	}
  }

This results in the output

FormC
length 0 count 2082
length 1 count 63376
length 2 count 76
length 3 count 2
FormD
length 0 count 2082
length 1 count 51258
length 2 count 1167
length 3 count 10993
length 4 count 36
FormKC
length 0 count 2082
length 1 count 62290
length 2 count 692
length 3 count 401
length 4 count 52
length 5 count 15
length 6 count 2
length 8 count 1
length 18 count 1
FormKD
length 0 count 2082
length 1 count 50151
length 2 count 1750
length 3 count 11412
length 4 count 108
length 5 count 16
length 6 count 14
length 7 count 1
length 8 count 1
length 18 count 1

Wait, 18??

To find out what’s going on, I rewrote the code as WinForm application to get support for Unicode fonts.

As it turns out, the longest sequences depending on the selected Normalization Form are:

FormC

nothing truly spectacular, just two code points resulting in length 3

שּׁ (U+fb2c): 3, ש (U+05e9), ּ (U+05bc), ׁ (U+05c1)
שּׂ (U+fb2d): 3, ש (U+05e9), ּ (U+05bc), ׂ (U+05c2)

FormD

length 4 for some characters in the Greek Extended Block.

FormKC

length 4, 5, 6: denormalization of Roman numerals, parenthesized numbers, and Japanese and Latin SQUARE abbreviations (such as units of measurement, etc.)

Ⅷ (U+2167): 4, V (U+0056), I (U+0049), I (U+0049), I (U+0049)
⑽ (U+247d): 4, ( (U+0028), 1 (U+0031), 0 (U+0030), ) (U+0029)
㌳ (U+3333): 4, フ (U+30d5), ィ (U+30a3), ー (U+30fc), ト (U+30c8)

length 8: ‘ARABIC LIGATURE JALLAJALALOUHOU’ (U+FDFB)
ﷻ (U+fdfb): 8, ج (U+062c), ل (U+0644),   (U+0020), ج (U+062c), ل (U+0644), ا (U+0627), ل (U+0644), ه (U+0647)

length 18: ‘ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM’ (U+FDFA)
ﷺ (U+fdfa): 18, ص (U+0635), ل (U+0644), ى (U+0649),   (U+0020), ا (U+0627), ل (U+0644), ل (U+0644), ه (U+0647),   (U+0020), ع (U+0639), ل (U+0644), ي (U+064a), ه (U+0647),   (U+0020), و (U+0648), س (U+0633), ل (U+0644), م (U+0645)

FormKD

length 4, 5, 6: Again Greek, Parenthesized Numerals and Hangul, SQUARE Japanese and Latin abbreviations

length 7: ‘PARENTHESIZED KOREAN CHARACTER OJEON’ (U+321D)

length 8 as under FormKC

PS

Software often implements more features than described in standards. Have a look at this SO question on the abuse of COMBINING marks to see one result, and my explanation.

ด้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้дด็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้


Writing “Funny Characters”? Make sure you use the correct Encoding!

January 14, 2013

A user of SMOscript sent me a bug report:

Some of our stored procedures have apostrophes in comments. If we export our DB and enable the “if not exists” option, our scripts become invalid, as the commented apostrophe ends the string.

He suspected SMOscript might encode the SP code incorrectly, but I made sure that was not the case, so I asked for a (redacted) sample of such a stored procedure.

I added the SP to a database, and in SSMS ran Generate Script on the procedure both with and without the IF NOT EXISTS option. In “plain” mode this simply generates the CREATE PROCEDURE statement, whereas in the IF NOT EXISTS case it will generate the IF NOT EXISTS check, and then create the SP using sp_executesql and dynamic SQL (and some people do not like this very much).

Next, I ran SMOscript with the -i switch, which activates the IF NOT EXISTS mode:

>smoscript.exe -s localhost -d mydatabase -o fooSP -i > fooSP.sql

When I opened the generated file in SSMS, and indeed, the lines that originally contained an apostrophe not contained a quote which ended the string. But I also noted that other quote characters got correctly escaped using 2 quotes.

Then it struck me: the email mentioned apostrophes, but what I saw here was quotes!

I opened the original file in Notepad++ in hex mode, and there it was: e2 80 99, the UTF-8 encoding for U+2019, formally named RIGHT SINGLE QUOTATION MARK, but apostrophe among friends 😉

Given its code point, it is obvious that this character is neither in the ASCII nor in the ASCII character set, so SMOscript has to generate Unicode or UTF-8 encoding.

Fortunately, this functionality is already built-in: use -T for Unicode, or -U for UTF-8 encoding:

>smoscript.exe -s localhost -d mydatabase -o fooSP -i -U -f .\fooSP.sql

Note that I use the -f switch to directly write the output to a file, since the console might not be able to display Unicode characters, or it might, and I’m not smart enough 😉 Anyway, directly creating the file from SMOscript frees you from the hassles of piping “funny” characters.

In hindsight, what happened was that SMOscript wrote the output to the console (via piping), and, using the native code page, the console was not able to correctly output the apostrophe character, and replaced it with the quote character, thus breaking the string parameter for sp_executesql.

You should always be aware that if you translate between code pages, not every character can be mapped to a character in the target code page, and some mappings will cause difficulties.

Fortunately, this problem was quite easy to solve.


“Why can’t I store/display [insert name] language data in SQL Server?”

September 17, 2012

I came across this and similar questions so many times on Stack Overflow:

My guess is that most of these question are due to a lack of understanding of Unicode and how the various components of an application must work together to display Unicode texts correctly.

Storage

If you store Unicode data in the database, you have to use the Unicode data types NVARCHAR(n), NVARCHAR(MAX), NTEXT (obsolete) or NCHAR (fixed-length strings). Unicode data is stored in SQL Server using the UCS-2 encoding, meaning every character uses 2 bytes to store, unless the database is created using an Supplementary Characters collation, causing the database to use the UTF-16 encoding (same encoding for the BMP, but support for all Unicode planes).

Except for the _SC collations, defining a collation does not have an influence on how Unicode data is stored.

In general, a collation defines how text data is compared and sorted. The collation can be set on database level (default for all text columns), on column level, and on operand level (i.e. the value of a comparison or ORDER BY clause) using the COLLATE clause.

Unicode Values

Unicode values can either be variables (or parameters) declared with the Unicode data types listed above, or a Unicode string literals using the N” notation. Most of the times, the questions above arise from either writing literals without the N”, or from believing that variable names must be attached the N somehow.

Let’s look at some examples:

  • SELECTing string literals
SELECT N'тест', 'тест'
тест    ????
  • Assigning (non-)Unicode values
declare @nvc1 nvarchar(50), @nvc2 nvarchar(50)
declare @vc1 varchar(50), @vc2 varchar(50)

set @nvc1 = N'тест'
set @nvc2 = 'тест'
set @vc1 = N'тест'
set @vc2 = 'тест'

print @nvc1
print @nvc2
print @vc1
print @vc2
тест
????
????
????

The Unicode value only “survives” if every assignment is Unicode-enabled.

  • Storing from C#

.Net by definition supports Unicode. If you store data into or retrieve data from a Unicode-enabled column using either SqlCommand or a typical DAL/ORM component, you should not even notice that you are dealing with Unicode rather than plain ASCII.

Just be aware that in case of SqlCommand, you better use parameterized statements and SqlParameter rather than concatenating the complete statement including parameter values. (This approach is also most common to avoid SQL injections)

Displaying Unicode data

Next problem: You successfully stored Unicode data, but SELECTing in SSMS only returns lots of question marks or boxes. The reason is simple: SSMS cannot display the characters.

But what is required actually necessary to display Unicode data:

  • a Unicode font covering the characters you want to retrieve
  • a font rendering engine as part of the operating system capable of rendering the character string correctly (especially for right-to-left languages, and some South-East Asian languages)
  • an application that makes use of both

I know from my experience that SSMS 2005 was not capable by default to display some Unicode characters (sorry, can’t remember which) even though fonts were installed. It probably had to do with the fonts defined for the editor or the result pane.

As a fact, SSMS 2008 on Windows 7 does not have any problems in the results pane, and can even display supplementary characters.

If you still have problems display Unicode characters, make sure to check whether the data is stored correctly by analyzing the numerical values of the Unicode string using the UNICODE function:

declare @t nvarchar(4000)
select @t = SomeUnicodeColumn from MyTable where ...
select unicode(substring( @t, 1, 1))
select unicode(substring( @t, 2, 2))
select unicode(substring( @t, 3, 3))
select unicode(substring( @t, 4, 4))

If the values are greater than 128 (or 256, depending on language), your data is stored as Unicode.


Unicode Versions supported in SQL Server (2000-2012)

May 30, 2012

Update to a previous post on Unicode support in SQL Server.

Which version of the Unicode Standard is supported by MS SQL Server?

SQL Server Version Unicode Version
2000 3.0
2005 3.2
2008 5.0
2008 R2 5.0
2012 5.0*

* The documentation on SQL Server 2012 does not explicitly refer to a Unicode version number. Even if there was no change, SQL Server 2012 introduced native support for supplementary characters.