I recently came across this interesting question on SO:
Why does the statement
select case when N'ܐܪܡܝܐ' = N'አማርኛ' then 1 else 0 end
Obviously, the strings are not equal, but still the result of the comparison is true. But why?
The reason is, of course, that string comparison and sorting in SQL Server depend on the collation in use. And the collation used in the example was SQL_Latin1_General_CP1_CI_AS, which does not distinguish Aramaic and Amharic characters.
If you just want to test whether two strings are equal as in “equal Code Points” or “binary equal”, you need to select a specific collation which distinguishes the code points, such as Latin1_General_BIN:
select case when N'ܐܪܡܝܐ' COLLATE Latin1_General_BIN = N'አማርኛ' COLLATE Latin1_General_BIN then 1 else 0 end
So how can we analyze how collations affect comparison?
First, I created two tables: One stores records about character groups, the second stores the characters in each group that are considered equal by the collation:
CREATE TABLE [dbo].[CollationString]( [Id] [int] IDENTITY(1,1) NOT NULL, [Collation] [varchar](100) NOT NULL, [String] [nvarchar](50) NOT NULL, [StringInt] [int] NULL, CONSTRAINT [PK_CollationString] PRIMARY KEY CLUSTERED ([Id] ASC)) GO CREATE TABLE [dbo].[CollationStringEquivalent]( [Id] [int] IDENTITY(1,1) NOT NULL, [CollationStringId] [int] NOT NULL, [String] [nvarchar](50) NOT NULL, [StringInt] [int] NULL, CONSTRAINT [PK_CollationStringEquivalent] PRIMARY KEY CLUSTERED ([Id] ASC)) GO
I added a field Collation so that different collations can be compared later on.
Next, we iterate through all characters in the Unicode BMP, look up the character, and insert according to the result:
set nocount on declare @c int = 0 declare @coll nvarchar(50) = 'Latin1_General_CI_AS' declare @cid int while @c < 65536 begin print @c set @cid = null; select @cid = id from collationstring where collation collate Latin1_General_CI_AS = @coll collate Latin1_General_CI_AS and string = nchar(@c); if @cid is null begin insert into collationstring(collation, string, stringint) values (@coll, NCHAR(@c), @c); select @cid = SCOPE_IDENTITY() end insert into collationstringequivalent(collationstringid, string, stringint) values (@cid, NCHAR(@c), @c); set @c = @c + 1 end
After running this script, we can now query which characters are considered different
SELECT * FROM [CollationString]
and which characters are equal
SELECT * FROM [CollationStringEquivalent] ORDER BY [CollationStringId], [StringInt]
To analyze the distribution of characters in one collation, we can start with a query like this
SELECT COUNT([Id]), [CollationStringId], MIN([String]), MIN([StringInt]) FROM [CollationStringEquivalent] GROUP BY collationstringid ORDER BY COUNT(id) DESC, MIN(stringint)