Length of UTF-8 VARCHAR in SQL Server

Foreword

For as long as I can remember, a VARCHAR (or CHAR) was always defined as “1 character equals 1 byte”. Different character sets (code pages) where implemented as COLLATIONs, so that you had basic database support for internationalization.

Then came Unicode, and we got NVARCHAR strings (or NCHAR), where the rule was “1 character equals 2 bytes”, and we could store any text from around the world without bothering with code pages, encodings, etc. The .Net framework brought us the string class with similar features and the world was beautiful.

Then, in 2001, came Unicode 3.1 and needed more space:

For the first time, characters are encoded beyond the original 16-bit codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the forthcoming standard ISO/IEC 10646-2. For further information, see Article IX, Relation to 10646. Unicode 3.1 and 10646-2 define three new supplementary planes.

These additional planes were immediately supported in SQL Server 2012. From now on, using an *_SC collation, NVARCHARs could be 2 or 4 bytes per character.

In C#, the StringInfo class handles supplementary planes, but it seems, they are still a bit behind:

Starting with the .NET Framework 4.6.2, character classification is based on The Unicode Standard, Version 8.0.0. For the .NET Framework 4 through the .NET Framework 4.6.1, it is based on The Unicode Standard, Version 6.3.0. In .NET Core, it is based on The Unicode Standard, Version 8.0.0.

(For the record, the current Unicode version is 12.1, and 13.0 is going to be released soon)

UTF-8 Collations

So now SQL Server 2019 supports UTF-8-enabled collations.

A question on SO quoted the documentation as

A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored

(emphasis mine) which confused me a little bit, and the quote continues

The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n.

(emphasis mine).

This got me investigating, and I had a look into this issue. I create a UTF8-enabled database with a table with all kinds of N/VARCHAR columns

CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8

CREATE TABLE [dbo].[UTF8Test](
  [Id] [int] IDENTITY(1,1) NOT NULL,
  [VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL,
  [VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
  [VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL,
  [NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL,
  [NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
  [NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL
)

I inserted test data from various Unicode ranges

INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES ('a','a','a','a','a','a')
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES ('ö','ö','ö',N'ö',N'ö',N'ö')
-- U+56D7
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗')
-- U+2000B
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')

then selected the lengths and data lengths of each text field

SELECT TOP (1000) [Id]
    ,[VarcharText],[VarcharTextSC],[VarcharUTF8]
    ,[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
    ,LEN([VarcharText]) VT,LEN([VarcharTextSC]) VTSC
    ,LEN([VarcharUTF8]) VU
    ,LEN([NVarcharText]) NVT,LEN([NVarcharTextSC]) NVTSC
    ,LEN([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
    ,DATALENGTH([VarcharText]) VT,DATALENGTH([VarcharTextSC]) VTSC
    ,DATALENGTH([VarcharUTF8]) VU
    ,DATALENGTH([NVarcharText]) NVT,DATALENGTH([NVarcharTextSC]) NVTSC
    ,DATALENGTH([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]

Select Lengths.png

I was surprised to find that the old mantra “a VARCHAR only stores single byte characters” needs to be revised when using UTF8 collations.

Table data only

Note that only table columns are associated with collations, but not T-SQL variables, as you cannot declare a collation on a variable

SELECT @VarcharText = [VarcharText],@NVarcharText = [NVarcharText]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

SELECT @VarcharText = [VarcharTextSC], @NVarcharText = [NVarcharTextSC]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

SELECT @VarcharText = [VarcharUTF8], @NVarcharText = [NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

 

Select Variable Lengths.png

 

Updating YuJisho: a Unicode CJK Character web dictionary

My online CJK dictionary YuJisho got a facelift again – this time, from ASP.Net MVC3 to the current MVC 5, and from Bootstrap 2 to Bootstrap 4.

I hope this gets rid of the messages in the Google Search Console 😉 :

  •  Text too small to read
  • Clickable elements too close together
  • Viewport not set

There is a little change though: In times of GDPR &co, queries to Wikipedias and Wiktionaries need to invoked by clicking the “Query Wikipedia” button, rather than querying automatically.

click the button

click the button “Query Wikipedia”

results in links to various Wikipedias containing an article

If your browser / operating system fails to display certain Chinese characters, there is now a button “Load Glyphs” which tries to load the unsupported characters’ images as .svg from GlyphWiki.

after load glyphs

after load glyphs u

Please check the About page for more information.

Emoticon Selection in ASP.Net MVC

I had to implement a good/bad/neutral selection UI in an ASP.Net MVC application.

This simple task turned out to be rather cumbersome, given different browsers implementing different things differently, and the MVC framework restricting access to generated markup.

I first thought about a DropDownList (i.e.) rendering a set of emoji, only to find

  • you cannot pass a C# emoji as SelectListItem.Text
new SelectListItem {Text = "\u1F60A" }
  • you cannot add the emoji HTML entity 😊 astext (at least not in Chrome)
  • you cannot add a @class parameter to SelectListItem
  • you cannot add a background-image:url() definition to an
  • you cannot add HTML tags inside an(

    )

However, I found code to overcome the limitations of SelectListItem, either by copying and extending code from the MS MVC framework, or by XML manipulation of the built-in HTML generator.

Maybe the dropdown list was just the wrong path to follow, so I searched for ways to style radiobuttons, and found this nice demo on codepen.

I modified the demo to fit my needs, and voilà, here’s my first codepen:

A radiobutton-based emoji selector

Update: Apparently, IE does not render an <img> inside a <label> clickable, unless you specify pointer-events: none.

Analyzing Combining Unicode Characters

Some scripts supported by the Unicode standard define combining characters, which may cause confusion for people not familiar with a specific script:

For one of these questions, I even analyzed the character sequence manually. But this analysis is not much fun if you try perform it repeatedly.

Recently I stumbled upon the SE user name n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳, so I had the idea to write a small program to output the Unicode code points and character names for a given input, based on Unicode’s UnicodeData.txt file.

The output for the text samples in the above links looks like this:

unispell 1

unispell 2

unispell 3

unispell 4

unispell nhahtdh

The initial version of this program is available for download here.

Updating YuJisho: a Unicode CJK Character web dictionary

I deployed by first version of YuJisho nearly 4 years ago, and, as I developed more and more MVC applications since then, I felt it was time to migrate the original ASP.Net application to ASP.Net MVC.

ASP.Net allowed (supported?) really messy code, so the challenges for an MVC migration are:

  • Extract business logic from the presentation layer to the business layer
  • Re-write the markup from ASP: controls to use native HTML
  • Re-write postbacks as HttpPost actions (both <form> and Ajax requests)

The layout also got a facelift using basic Bootstrap (version 2) styling, but the UI migration is not yet complete.

The data remains unchanged, containing Unicode 5.2, but an upgrade to Unicode 6.3 and the latest dictionary data is in the pipeline.

Enjoy browsing and searching 😉

ASCII(), UNICODE() and the Replacement Character

This is a follow-up article on my previous post about string equality and collations.

We know, simply from looking at them, that the characters 'ܐ' and 'አ' are different, but the collation you use may treat them as equal.

Well, we can still compare their code point value by using the UNICODE() function, as in

select unicode(N'ܐ'), unicode(N'አ')

returning 1808 and 4768.

The reason I write this is because I discovered a fallacy in a comment on SO, resulting from mixing Unicode and non-Unicode literals and functions.

Take the statement

select ascii('ܐ'), ascii('አ')

Note that the Unicode characters are not given as Unicode strings (N” notation), but as non-Unicode strings.

Since both characters cannot be mapped onto Latin ASCII (or whatever your collation is), they are replaced by a Replacement Character, which is the question mark ‘?’ in ASCII.

Wikipedia tells us

The question mark character is also often used in place of missing or unknown data. In Unicode, it is encoded at U+003F ? question mark (HTML: ?).

and

In many web browsers and other computer programs, “?” is used to show a character not found in the program’s character set. […] Some fonts will instead use the Unicode Replacement Glyph (U+FFFD, �), which is commonly rendered as a white question mark in a black diamond (see replacement character).

So we can see where the question mark comes from, and thus both functions return 63.

In a similar, but nonetheless different case

select ascii(N'ܐ'), ascii(N'አ')

the characters are defined as Unicode strings, but passed to a function that only accepts non-Unicode strings. In this case, the mapping according to the current collation is performed by the ASCII() function, again resulting in the value 63.

As for the Unicode Replacement Character, you’ll encounter them if you decode a byte array to Unicode, and the decoder encounters a byte sequence that cannot be converted to Unicode given the selected encoding.

Different Strings considered Equal – Depending on your Collation

I recently came across this interesting question on SO:

Why does the statement

select case when  N'ܐܪܡܝܐ' = N'አማርኛ' then 1 else 0 end

return 1?

In case you are wondering, the first string seems to be Syriac/Aramaic for Aramaic, and the second string is Amharic for Amharic, as found on Wikipedia.

Obviously, the strings are not equal, but still the result of the comparison is true. But why?

The reason is, of course, that string comparison and sorting in SQL Server depend on the collation in use. And the collation used in the example was SQL_Latin1_General_CP1_CI_AS, which does not distinguish Aramaic and Amharic characters.

If you just want to test whether two strings are equal as in “equal Code Points” or “binary equal”, you need to select a specific collation which distinguishes the code points, such as Latin1_General_BIN:

select case 
    when N'ܐܪܡܝܐ' COLLATE Latin1_General_BIN = N'አማርኛ' COLLATE Latin1_General_BIN 
    then 1 else 0 end

So how can we analyze how collations affect comparison?

First, I created two tables: One stores records about character groups, the second stores the characters in each group that are considered equal by the collation:

CREATE TABLE [dbo].[CollationString](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Collation] [varchar](100) NOT NULL,
    [String] [nvarchar](50) NOT NULL,
    [StringInt] [int] NULL,
 CONSTRAINT [PK_CollationString] PRIMARY KEY CLUSTERED ([Id] ASC))
GO

CREATE TABLE [dbo].[CollationStringEquivalent](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [CollationStringId] [int] NOT NULL,
    [String] [nvarchar](50) NOT NULL,
    [StringInt] [int] NULL,
 CONSTRAINT [PK_CollationStringEquivalent] PRIMARY KEY CLUSTERED ([Id] ASC))
GO

I added a field Collation so that different collations can be compared later on.

Next, we iterate through all characters in the Unicode BMP, look up the character, and insert according to the result:

set nocount on 

declare @c int = 0
declare @coll nvarchar(50) = 'Latin1_General_CI_AS'
declare @cid int

while @c < 65536 begin

    print @c

    set @cid = null;
    select @cid = id 
        from collationstring 
        where collation collate Latin1_General_CI_AS 
            = @coll collate Latin1_General_CI_AS 
        and string = nchar(@c);
    
    if @cid is null begin
        insert into collationstring(collation, string, stringint) 
            values (@coll, NCHAR(@c), @c);
        select @cid = SCOPE_IDENTITY()
    end
        
    insert into collationstringequivalent(collationstringid, string, stringint) 
        values (@cid, NCHAR(@c), @c);

    set @c = @c + 1
end

After running this script, we can now query which characters are considered different

SELECT  * FROM [CollationString]

and which characters are equal

SELECT  * 
FROM [CollationStringEquivalent]
ORDER BY [CollationStringId], [StringInt]

To analyze the distribution of characters in one collation, we can start with a query like this

SELECT COUNT([Id]), [CollationStringId], MIN([String]), MIN([StringInt])
FROM [CollationStringEquivalent]
GROUP BY collationstringid
ORDER BY COUNT(id) DESC, MIN(stringint)

 

What is the Longest String Resulting from a Normalized Single Unicode Code Point

Introduction

This article deals with Unicode Normalization, Composition, and Decomposition. If you are new to the topic, please read the introduction

Summary

Unicode defines 4 Normalization Forms (source: Unicode)

Form Description
Normalization Form D (NFD) Canonical Decomposition
Normalization Form C (NFC) Canonical Decomposition,
followed by Canonical Composition
Normalization Form KD (NFKD) Compatibility Decomposition
Normalization Form KC (NFKC) Compatibility Decomposition,
followed by Canonical Composition

which may result in different sequences of Unicode Code Points, depending on the chosen normalization, but still considered equal according to the Unicode (source: Wikipedia)

Amélie with its two canonically equivalent Unicode forms (NFC and NFD)
NFC character A m é l i e
NFC code point 0041 006d 00e9 006c 0069 0065
NFD code point 0041 006d 0065 0301 006c 0069 0065
NFD character A m e ◌́ l i e

Experiment

Lets iterate through all code points in the BMP and collect the lengths of their normalizations in all forms:

  foreach (var nf in new[] { NormalizationForm.FormC, NormalizationForm.FormD, 
    NormalizationForm.FormKC, NormalizationForm.FormKD })
  {
	var chars = new Dictionary<int, List<string>>();
	chars.Add(0, new List<string>());

	for (var i = 0; i < 65536; i++)
	{
	  var s = new string((char)i, 1);

	  try
	  {
		var l = s.Normalize(nf).Length;

		if (!chars.ContainsKey(l))
		  chars.Add(l, new List<string>());

		chars[l].Add(s);
	  }
	  catch
	  {
		chars[0].Add(s);
	  }
	}

	Console.WriteLine(nf.ToString());
	foreach (var kv in chars.OrderBy(d => d.Key))
	{
	  Console.WriteLine("length " + kv.Key + " count " + kv.Value.Count);
	}
  }

This results in the output

FormC
length 0 count 2082
length 1 count 63376
length 2 count 76
length 3 count 2
FormD
length 0 count 2082
length 1 count 51258
length 2 count 1167
length 3 count 10993
length 4 count 36
FormKC
length 0 count 2082
length 1 count 62290
length 2 count 692
length 3 count 401
length 4 count 52
length 5 count 15
length 6 count 2
length 8 count 1
length 18 count 1
FormKD
length 0 count 2082
length 1 count 50151
length 2 count 1750
length 3 count 11412
length 4 count 108
length 5 count 16
length 6 count 14
length 7 count 1
length 8 count 1
length 18 count 1

Wait, 18??

To find out what’s going on, I rewrote the code as WinForm application to get support for Unicode fonts.

As it turns out, the longest sequences depending on the selected Normalization Form are:

FormC

nothing truly spectacular, just two code points resulting in length 3

שּׁ (U+fb2c): 3, ש (U+05e9), ּ (U+05bc), ׁ (U+05c1)
שּׂ (U+fb2d): 3, ש (U+05e9), ּ (U+05bc), ׂ (U+05c2)

FormD

length 4 for some characters in the Greek Extended Block.

FormKC

length 4, 5, 6: denormalization of Roman numerals, parenthesized numbers, and Japanese and Latin SQUARE abbreviations (such as units of measurement, etc.)

Ⅷ (U+2167): 4, V (U+0056), I (U+0049), I (U+0049), I (U+0049)
⑽ (U+247d): 4, ( (U+0028), 1 (U+0031), 0 (U+0030), ) (U+0029)
㌳ (U+3333): 4, フ (U+30d5), ィ (U+30a3), ー (U+30fc), ト (U+30c8)

length 8: ‘ARABIC LIGATURE JALLAJALALOUHOU’ (U+FDFB)
ﷻ (U+fdfb): 8, ج (U+062c), ل (U+0644),   (U+0020), ج (U+062c), ل (U+0644), ا (U+0627), ل (U+0644), ه (U+0647)

length 18: ‘ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM’ (U+FDFA)
ﷺ (U+fdfa): 18, ص (U+0635), ل (U+0644), ى (U+0649),   (U+0020), ا (U+0627), ل (U+0644), ل (U+0644), ه (U+0647),   (U+0020), ع (U+0639), ل (U+0644), ي (U+064a), ه (U+0647),   (U+0020), و (U+0648), س (U+0633), ل (U+0644), م (U+0645)

FormKD

length 4, 5, 6: Again Greek, Parenthesized Numerals and Hangul, SQUARE Japanese and Latin abbreviations

length 7: ‘PARENTHESIZED KOREAN CHARACTER OJEON’ (U+321D)

length 8 as under FormKC

PS

Software often implements more features than described in standards. Have a look at this SO question on the abuse of COMBINING marks to see one result, and my explanation.

ด้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้дด็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้

Writing “Funny Characters”? Make sure you use the correct Encoding!

A user of SMOscript sent me a bug report:

Some of our stored procedures have apostrophes in comments. If we export our DB and enable the “if not exists” option, our scripts become invalid, as the commented apostrophe ends the string.

He suspected SMOscript might encode the SP code incorrectly, but I made sure that was not the case, so I asked for a (redacted) sample of such a stored procedure.

I added the SP to a database, and in SSMS ran Generate Script on the procedure both with and without the IF NOT EXISTS option. In “plain” mode this simply generates the CREATE PROCEDURE statement, whereas in the IF NOT EXISTS case it will generate the IF NOT EXISTS check, and then create the SP using sp_executesql and dynamic SQL (and some people do not like this very much).

Next, I ran SMOscript with the -i switch, which activates the IF NOT EXISTS mode:

>smoscript.exe -s localhost -d mydatabase -o fooSP -i > fooSP.sql

When I opened the generated file in SSMS, and indeed, the lines that originally contained an apostrophe not contained a quote which ended the string. But I also noted that other quote characters got correctly escaped using 2 quotes.

Then it struck me: the email mentioned apostrophes, but what I saw here was quotes!

I opened the original file in Notepad++ in hex mode, and there it was: e2 80 99, the UTF-8 encoding for U+2019, formally named RIGHT SINGLE QUOTATION MARK, but apostrophe among friends 😉

Given its code point, it is obvious that this character is neither in the ASCII nor in the ASCII character set, so SMOscript has to generate Unicode or UTF-8 encoding.

Fortunately, this functionality is already built-in: use -T for Unicode, or -U for UTF-8 encoding:

>smoscript.exe -s localhost -d mydatabase -o fooSP -i -U -f .\fooSP.sql

Note that I use the -f switch to directly write the output to a file, since the console might not be able to display Unicode characters, or it might, and I’m not smart enough 😉 Anyway, directly creating the file from SMOscript frees you from the hassles of piping “funny” characters.

In hindsight, what happened was that SMOscript wrote the output to the console (via piping), and, using the native code page, the console was not able to correctly output the apostrophe character, and replaced it with the quote character, thus breaking the string parameter for sp_executesql.

You should always be aware that if you translate between code pages, not every character can be mapped to a character in the target code page, and some mappings will cause difficulties.

Fortunately, this problem was quite easy to solve.