Length of UTF-8 VARCHAR in SQL Server

Foreword

For as long as I can remember, a VARCHAR (or CHAR) was always defined as “1 character equals 1 byte”. Different character sets (code pages) where implemented as COLLATIONs, so that you had basic database support for internationalization.

Then came Unicode, and we got NVARCHAR strings (or NCHAR), where the rule was “1 character equals 2 bytes”, and we could store any text from around the world without bothering with code pages, encodings, etc. The .Net framework brought us the string class with similar features and the world was beautiful.

Then, in 2001, came Unicode 3.1 and needed more space:

For the first time, characters are encoded beyond the original 16-bit codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the forthcoming standard ISO/IEC 10646-2. For further information, see Article IX, Relation to 10646. Unicode 3.1 and 10646-2 define three new supplementary planes.

These additional planes were immediately supported in SQL Server 2012. From now on, using an *_SC collation, NVARCHARs could be 2 or 4 bytes per character.

In C#, the StringInfo class handles supplementary planes, but it seems, they are still a bit behind:

Starting with the .NET Framework 4.6.2, character classification is based on The Unicode Standard, Version 8.0.0. For the .NET Framework 4 through the .NET Framework 4.6.1, it is based on The Unicode Standard, Version 6.3.0. In .NET Core, it is based on The Unicode Standard, Version 8.0.0.

(For the record, the current Unicode version is 12.1, and 13.0 is going to be released soon)

UTF-8 Collations

So now SQL Server 2019 supports UTF-8-enabled collations.

A question on SO quoted the documentation as

A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored

(emphasis mine) which confused me a little bit, and the quote continues

The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n.

(emphasis mine).

This got me investigating, and I had a look into this issue. I create a UTF8-enabled database with a table with all kinds of N/VARCHAR columns

CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8

CREATE TABLE [dbo].[UTF8Test](
  [Id] [int] IDENTITY(1,1) NOT NULL,
  [VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL,
  [VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
  [VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL,
  [NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL,
  [NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
  [NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL
)

I inserted test data from various Unicode ranges

INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES ('a','a','a','a','a','a')
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES ('ö','ö','ö',N'ö',N'ö',N'ö')
-- U+56D7
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗')
-- U+2000B
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')

then selected the lengths and data lengths of each text field

SELECT TOP (1000) [Id]
    ,[VarcharText],[VarcharTextSC],[VarcharUTF8]
    ,[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
    ,LEN([VarcharText]) VT,LEN([VarcharTextSC]) VTSC
    ,LEN([VarcharUTF8]) VU
    ,LEN([NVarcharText]) NVT,LEN([NVarcharTextSC]) NVTSC
    ,LEN([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
    ,DATALENGTH([VarcharText]) VT,DATALENGTH([VarcharTextSC]) VTSC
    ,DATALENGTH([VarcharUTF8]) VU
    ,DATALENGTH([NVarcharText]) NVT,DATALENGTH([NVarcharTextSC]) NVTSC
    ,DATALENGTH([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]

Select Lengths.png

I was surprised to find that the old mantra “a VARCHAR only stores single byte characters” needs to be revised when using UTF8 collations.

Table data only

Note that only table columns are associated with collations, but not T-SQL variables, as you cannot declare a collation on a variable

SELECT @VarcharText = [VarcharText],@NVarcharText = [NVarcharText]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

SELECT @VarcharText = [VarcharTextSC], @NVarcharText = [NVarcharTextSC]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

SELECT @VarcharText = [VarcharUTF8], @NVarcharText = [NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

 

Select Variable Lengths.png

 

Why do I need the [Flags] Attribute?

As a developer you often take things for granted because they are “accepted best practices” until somebody asks about the most basic and simplest things, “why do I need [X]”, and you think, well that’s obvious, until you realize the (correct) answer was not so obvious at all, especially if the documentation is not accurate.

It happened to me when I came across the SO question, “What does the [Flags] Attribute Really do?

Time to investigate.

[Flags] can be attributes to Enums, and Enums are type-safe int’s (basically).

In C#, you can apply bitwise operators such as | and & (or, and), etc., so the documentation of the FlagsAttribute class

Indicates that an enumeration can be treated as a bit field; that is, a set of flags.

raises some questions: “can be?”, “as opposed to what?”, “is it really necessary?”

First, the enum values have to be powers of 2 to make the Flags attribute behave as expected.

Let’s take colors as example: 3 different color enumerations (yes, we only have 1-bit color depth here 😉 )

enum ColorEnum
{
	None,
	Red,
	Green,
	Blue
}

enum SimpleColor
{
	Red = 1,
	Green = 2,
	Blue = 4
}

[Flags]
enum FlagsColor
{
	Red = 1,
	Green = 2,
	Blue = 4,
	White = 7
}

ColorEnum can only take a single color, SimpleColor (powers of 2) *could* take a combination of colors, but lacks the [Flags] attribute, and FlagsColor has [Flags] and defines White as combination of Red|Green|Blue.

To test the different enum definitions, we call the enum’s .ToString() method, cast the enum values to int, and Parse() the result of the ToString() (using C#4 in VS2010):

ColorEnum ce = ColorEnum.Red;
Console.WriteLine("ColorEnum Red: " + ce.ToString());
Console.WriteLine("ColorEnum Red: " + ((int)ce).ToString());
Console.WriteLine("  parse: " + 
    Enum.Parse(typeof(ColorEnum), ce.ToString(), true).ToString());

which has the unsurprising result:

ColorEnum Red: Red
ColorEnum Red: 1
  parse: Red

A bit unexpected, but you can use bitwise operators on enums without a [Flag] attribute

ce = ColorEnum.Red | ColorEnum.Green;
Console.WriteLine("ColorEnum Red | Green: " + ce.ToString());
Console.WriteLine("  parse: " + 
    Enum.Parse(typeof(ColorEnum), ce.ToString(), true).ToString());

Since Red==1, Green==2, and Blue==3, the result is

ColorEnum Red | Green: Blue
  parse: Blue

Let’s experiment with SimpleColor, enum holding values of power of 2, but without the [Flags] attribute:

SimpleColor sc = SimpleColor.Red;
Console.WriteLine("SimpleColor Red: " + sc.ToString());
Console.WriteLine("  parse: " + 
    Enum.Parse(typeof(SimpleColor), sc.ToString(), true).ToString());

sc = SimpleColor.Red | SimpleColor.Green;
Console.WriteLine("SimpleColor Red | Green: " + sc.ToString());
Console.WriteLine("  parse: " + 
    Enum.Parse(typeof(SimpleColor), sc.ToString(), true).ToString());

Note that Red|Green gives 3, and the enum does not define a symbol for value 3:

SimpleColor Red: Red
  parse: Red
SimpleColor Red | Green: 3
  parse: 3

Finally, the FlagsColor enum with [Flags] attribute:

FlagsColor fc = FlagsColor.Red;
Console.WriteLine("FlagsColor Red: " + fc.ToString());
Console.WriteLine("  parse: " + 
    Enum.Parse(typeof(FlagsColor), fc.ToString(), true).ToString());
fc = FlagsColor.Red | FlagsColor.Green;
Console.WriteLine("FlagsColor Red | Green: " + fc.ToString());
Console.WriteLine("  parse: " + 
    Enum.Parse(typeof(FlagsColor), fc.ToString(), true).ToString());

Console.WriteLine("  parse FlagsColor as SimpleColor: " + 
    Enum.Parse(typeof(SimpleColor), fc.ToString(), true).ToString());
Console.WriteLine("  parse FlagsColor as ColorEnum: " + 
    Enum.Parse(typeof(ColorEnum), fc.ToString(), true).ToString());

The first part produces what is expected:

FlagsColor Red: Red
  parse: Red
FlagsColor Red | Green: Red, Green
  parse: Red, Green

But what about the stringified bit combination “Red, Green” if it is parsed by the other enums?

  parse FlagsColor as SimpleColor: 3
  parse FlagsColor as ColorEnum: Blue

Seems that the Enum.Parse() method ignores the (lack of the) [Flags] attribute!

The .HasFlag() method also works regardless of the [Flags] attribute:

Console.WriteLine("ColorEnum has Red? " + ce.HasFlag(ColorEnum.Red));
Console.WriteLine("SimpleColor has Red? " + sc.HasFlag(SimpleColor.Red));
Console.WriteLine("FlagsColor has Red? " + fc.HasFlag(FlagsColor.Red));

Console.WriteLine("ColorEnum has Blue? " + ce.HasFlag(ColorEnum.Blue));
Console.WriteLine("SimpleColor has Blue? " + sc.HasFlag(SimpleColor.Blue));
Console.WriteLine("FlagsColor has Blue? " + fc.HasFlag(FlagsColor.Blue));

results in:

ColorEnum has Red? True
SimpleColor has Red? True
FlagsColor has Red? True
ColorEnum has Blue? True
SimpleColor has Blue? False
FlagsColor has Blue? False

Finally, testing a composite bit value:

fc = FlagsColor.Red | FlagsColor.Green | FlagsColor.Blue;
Console.WriteLine("FlagsColor RGB: " + fc.ToString());
Console.WriteLine("  parse: " + 
    Enum.Parse(typeof(FlagsColor), "Red, Green, Blue", true).ToString());
Console.WriteLine("FlagsColor has White? " + fc.HasFlag(FlagsColor.White));

results in

FlagsColor RGB: White
  parse: White
FlagsColor has White? True

If you work with VB.Net rather than C#, this entry on social.msdn may be interesting for you:

Although C# happily allows users to perform bit operations on enums without the FlagsAttribute, Visual Basic does not. So if you are exposing types to other languages, then marking enums with the FlagsAttribute is a good idea.

It also states

(The [Flags] attribute) makes it clear that the members of the enum are designed to be used together.

Some things should be described more explicitly in the documentation.

Why do I need NVARCHAR columns?

I came across a (MSSQL) database that almost only contained VARCHAR columns, even though the target audience was expected to come from all over the European Union.

In my projects, I am used to define NVARCHAR columns for all values that can be written by the users, and use VARCHAR only for strings that are known to be ASCII, such as program identifiers, URLs, or email addresses.

I tried to show the disadvantage of using VARCHAR by inputing accented characters, expecting a default Western collation, only to fail insofar as that the characters were displayed correctly. I found later that the collation had been explicitly set to a character set support these accented characters.

I prepared more thoroughly for the next time, and wrote a little TSQL script to illustrate my point:

create table #t (
	id int identity,
	sA varchar(100),	-- insert varchar
	sN varchar(100),	-- insert nvarchar (result same as varchar)
	sC varchar(100) collate Croatian_CI_AS,	-- insert nvarchar
	n nvarchar(100)
)

declare @i int
declare @s varchar(100)	-- implicit collation of database
declare @n nvarchar(100)
set @s = ''
set @n = ''

set @i = 32
while @i < 1000 begin
	set @s = @s + nchar(@i)
	set @n = @n + nchar(@i)
	set @i = @i + 1

	if (len(@s)=32) begin
		insert into #t (sA, n, sN, sC) values (@s, @n, @n, @n)

		print @s
		print @n

		set @s = ''
		set @n = ''
	end
end

select id as Row, sA as DBCollated, sC as HRCollated, n as Unicoded
from #t
order by id

drop table #t

Table #T contains a couple of text columns:

  • sA varchar with database collation is set by a varchar variable
  • sN varchar with database collation is set by an nvarchar variable
  • sC varchar with explicit collation is also set by an nvarchar variable
  • n nvarchar is set by an nvarchar variable

In the loop, I construct strings with 32 Unicode characters, ranging from code point 32 to code point 1000 (that’s just arbitrary codes), and insert the same value into the 4 columns.

The findings of this experiment are:

VARCHAR variables have the same collation as the database. You cannot declare a COLLATE clause in the variable definition. Thus sA and sN contain the same values, as would be expected by mapping Unicode characters onto a code page.

The resulting table looks like this (only 16 characters per row for better readability):

VARCHAR Latin1_General_CI_AS VARCHAR Croatian_CI_AS NVARCHAR
Database Collation Explicit Collation Unicode
1 !”#$%&'()*+,-./ !”#$%&'()*+,-./ !”#$%&'()*+,-./
2 0123456789:;<=>? 0123456789:;<=>? 0123456789:;<=>?
3 @ABCDEFGHIJKLMNO @ABCDEFGHIJKLMNO @ABCDEFGHIJKLMNO
4 PQRSTUVWXYZ[\]^_ PQRSTUVWXYZ[\]^_ PQRSTUVWXYZ[\]^_
5 `abcdefghijklmno `abcdefghijklmno `abcdefghijklmno
6 pqrstuvwxyz{|}~ pqrstuvwxyz{|}~ pqrstuvwxyz{|}~
7 ????????????? ??ƒ????ˆ??????? €‚ƒ„…†‡ˆ‰Š‹ŒŽ
8 ?????????????? ???????˜??????? ‘’“”•–—˜™š›œžŸ
9 ¡¢£¤¥¦§¨©ª«¬­®¯ !cL¤Y¦§¨©a«¬­®— ¡¢£¤¥¦§¨©ª«¬­®¯
10 °±²³´µ¶·¸¹º»¼½¾¿ °±23´µ¶·¸1o»113? °±²³´µ¶·¸¹º»¼½¾¿
11 ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ AÁÂAÄAAÇEÉEËIÍÎI ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
12 ÐÑÒÓÔÕÖרÙÚÛÜÝÞß ?NOÓÔOÖ×OUÚUÜÝ?ß ÐÑÒÓÔÕÖרÙÚÛÜÝÞß
13 àáâãäåæçèéêëìíîï aáâaäaaçeéeëiíîi àáâãäåæçèéêëìíîï
14 ðñòóôõö÷øùúûüýþÿ ?noóôoö÷ouúuüý?y ðñòóôõö÷øùúûüýþÿ
15 AaAaAaCcCcCcCcDd AaĂ㥹ĆćCcCcČčĎď ĀāĂ㥹ĆćĈĉĊċČčĎď
16 ÐdEeEeEeEeEeGgGg ĐđEeEeEeĘęĚěGgGg ĐđĒēĔĕĖėĘęĚěĜĝĞğ
17 GgGgHhHhIiIiIiIi GgGgHhHhIiIiIiIi ĠġĢģĤĥĦħĨĩĪīĬĭĮį
18 Ii??JjKk?LlLlLl? Ii??JjKk?ĹĺLlĽľ? İıIJijĴĵĶķĸĹĺĻļĽľĿ
19 ?LlNnNnNn???OoOo ?ŁłŃńNnŇň???OoOo ŀŁłŃńŅņŇňʼnŊŋŌōŎŏ
20 OoŒœRrRrRrSsSsSs ŐőOoŔŕRrŘřŚśSsŞş ŐőŒœŔŕŖŗŘřŚśŜŝŞş
21 ŠšTtTtTtUuUuUuUu ŠšŢţŤťTtUuUuUuŮů ŠšŢţŤťŦŧŨũŪūŬŭŮů
22 UuUuWwYyŸZzZzŽž? ŰűUuWwYyYŹźŻżŽž? ŰűŲųŴŵŶŷŸŹźŻżŽžſ
23 b????????Ð?????? b????????Đ?????? ƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏ
24 ?ƒƒ????I??l????O ?Ff????I??l????O ƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟ
25 Oo?????????t??TU Oo?????????t??TU ƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯ
26 u?????z????????? u?????z????????? ưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿ
27 |??!?????????AaI |??!?????????AaI ǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏ
28 iOoUuUuUuUuUu?Aa iOoUuUuUuUuUu?Aa ǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟ
29 ????GgGgKkOoOo?? ????GgGgKkOoOo?? ǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯ
30 j??????????????? j??????????????? ǰDZDzdzǴǵǶǷǸǹǺǻǼǽǾǿ
31 ???????????????? ???????????????? ȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏ
32 ???????????????? ???????????????? ȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟ
33 ???????????????? ???????????????? ȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯ
34 ???????????????? ???????????????? ȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿ
35 ???????????????? ???????????????? ɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ
36 ???????????????? ???????????????? ɐɑɒɓɔɕɖɗɘəɚɛɜɝɞɟ
37 ?g?????????????? ?g?????????????? ɠɡɢɣɤɥɦɧɨɩɪɫɬɭɮɯ
38 ???????????????? ???????????????? ɰɱɲɳɴɵɶɷɸɹɺɻɼɽɾɿ
39 ???????????????? ???????????????? ʀʁʂʃʄʅʆʇʈʉʊʋʌʍʎʏ
40 ???????????????? ???????????????? ʐʑʒʓʔʕʖʗʘʙʚʛʜʝʞʟ
41 ???????????????? ???????????????? ʠʡʢʣʤʥʦʧʨʩʪʫʬʭʮʯ
42 ?????????'”?’??? ?????????'”‘’??? ʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿ
43 ????^?ˆ?’¯´`?_?? ????^?^ˇ’Ż´`?_?? ˀˁ˂˃˄˅ˆˇˈˉˊˋˌˍˎˏ
44 ??????????°?˜??? ????????˘˙°˛~˝?? ːˑ˒˓˔˕˖˗˘˙˚˛˜˝˞˟
45 ???????????????? ???????????????? ˠˡˢˣˤ˥˦˧˨˩˪˫ˬ˭ˮ˯
46 ???????????????? ???????????????? ˰˱˲˳˴˵˶˷˸˹˺˻˼˽˾˿
47 `´^~¯¯??¨?°???”? `´^~ŻŻ˘˙¨?°?ˇ?”? ̀́̂̃̄̅̆̇̈̉̊̋̌̍̎̏
48 ???????????????? ???????????????? ̛̖̗̘̙̜̝̞̟̐̑̒̓̔̕̚
49 ???????¸???????? ???????¸???????? ̡̢̧̨̠̣̤̥̦̩̪̫̬̭̮̯
50 ?__????????????? ?__????????????? ̴̵̶̷̸̰̱̲̳̹̺̻̼̽̾̿
51 ???????????????? ???????????????? ͇͈͉͍͎̀́͂̓̈́͆͊͋͌ͅ͏
52 ???????????????? ???????????????? ͓͔͕͖͙͚͐͑͒͗͛͘͜͟͝͞
53 ???????????????? ???????????????? ͣͤͥͦͧͨͩͪͫͬͭͮͯ͢͠͡
54 ??????????????;? ??????????????;? ͰͱͲͳʹ͵Ͷͷ͸͹ͺͻͼͽ;Ϳ
55 ???????????????? ???????????????? ΀΁΂΃΄΅Ά·ΈΉΊ΋Ό΍ΎΏ
56 ???G????T??????? ???????????????? ΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟ
57 ???S??F??O?????? ???????????????? ΠΡ΢ΣΤΥΦΧΨΩΪΫάέήί
58 ?aß?de??????µ??? ??ß?????????µ??? ΰαβγδεζηθικλμνξο
59 p??st?f????????? ???????????????? πρςστυφχψωϊϋόύώϏ
60 ???????????????? ???????????????? ϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟ

One can easily see that the original Unicode values are translated into VARCHAR values according to the COLLATION setting. Some characters not contained in the collation are mapped onto characters without accents, others are simply replaced by question marks.

Use this code if you need to explain the differences between VARCHAR and NVARCHAR, and why using NVARCHAR is not a bad idea.