Foreword
For as long as I can remember, a VARCHAR (or CHAR) was always defined as “1 character equals 1 byte”. Different character sets (code pages) where implemented as COLLATIONs, so that you had basic database support for internationalization.
Then came Unicode, and we got NVARCHAR strings (or NCHAR), where the rule was “1 character equals 2 bytes”, and we could store any text from around the world without bothering with code pages, encodings, etc. The .Net framework brought us the string class with similar features and the world was beautiful.
Then, in 2001, came Unicode 3.1 and needed more space:
For the first time, characters are encoded beyond the original 16-bit codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the forthcoming standard ISO/IEC 10646-2. For further information, see Article IX, Relation to 10646. Unicode 3.1 and 10646-2 define three new supplementary planes.
These additional planes were immediately supported in SQL Server 2012. From now on, using an *_SC collation, NVARCHARs could be 2 or 4 bytes per character.
In C#, the StringInfo class handles supplementary planes, but it seems, they are still a bit behind:
Starting with the .NET Framework 4.6.2, character classification is based on The Unicode Standard, Version 8.0.0. For the .NET Framework 4 through the .NET Framework 4.6.1, it is based on The Unicode Standard, Version 6.3.0. In .NET Core, it is based on The Unicode Standard, Version 8.0.0.
(For the record, the current Unicode version is 12.1, and 13.0 is going to be released soon)
UTF-8 Collations
So now SQL Server 2019 supports UTF-8-enabled collations.
A question on SO quoted the documentation as
A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored
(emphasis mine) which confused me a little bit, and the quote continues
The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n.
(emphasis mine).
This got me investigating, and I had a look into this issue. I create a UTF8-enabled database with a table with all kinds of N/VARCHAR columns
CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 CREATE TABLE [dbo].[UTF8Test]( [Id] [int] IDENTITY(1,1) NOT NULL, [VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL, [VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL, [VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL, [NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL, [NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL, [NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL )
I inserted test data from various Unicode ranges
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES ('a','a','a','a','a','a') INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES ('ö','ö','ö',N'ö',N'ö',N'ö') -- U+56D7 INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗') -- U+2000B INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')
then selected the lengths and data lengths of each text field
SELECT TOP (1000) [Id] ,[VarcharText],[VarcharTextSC],[VarcharUTF8] ,[NVarcharText],[NVarcharTextSC],[NVarcharUTF8] FROM [test-sc].[dbo].[UTF8Test] SELECT TOP (1000) [Id] ,LEN([VarcharText]) VT,LEN([VarcharTextSC]) VTSC ,LEN([VarcharUTF8]) VU ,LEN([NVarcharText]) NVT,LEN([NVarcharTextSC]) NVTSC ,LEN([NVarcharUTF8]) NVU FROM [test-sc].[dbo].[UTF8Test] SELECT TOP (1000) [Id] ,DATALENGTH([VarcharText]) VT,DATALENGTH([VarcharTextSC]) VTSC ,DATALENGTH([VarcharUTF8]) VU ,DATALENGTH([NVarcharText]) NVT,DATALENGTH([NVarcharTextSC]) NVTSC ,DATALENGTH([NVarcharUTF8]) NVU FROM [test-sc].[dbo].[UTF8Test]
I was surprised to find that the old mantra “a VARCHAR only stores single byte characters” needs to be revised when using UTF8 collations.
Table data only
Note that only table columns are associated with collations, but not T-SQL variables, as you cannot declare a collation on a variable
SELECT @VarcharText = [VarcharText],@NVarcharText = [NVarcharText] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText) , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText) SELECT @VarcharText = [VarcharTextSC], @NVarcharText = [NVarcharTextSC] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText) , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText) SELECT @VarcharText = [VarcharUTF8], @NVarcharText = [NVarcharUTF8] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText) , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)