Different Strings considered Equal – Depending on your Collation

I recently came across this interesting question on SO:

Why does the statement

select case when  N'ܐܪܡܝܐ' = N'አማርኛ' then 1 else 0 end

return 1?

In case you are wondering, the first string seems to be Syriac/Aramaic for Aramaic, and the second string is Amharic for Amharic, as found on Wikipedia.

Obviously, the strings are not equal, but still the result of the comparison is true. But why?

The reason is, of course, that string comparison and sorting in SQL Server depend on the collation in use. And the collation used in the example was SQL_Latin1_General_CP1_CI_AS, which does not distinguish Aramaic and Amharic characters.

If you just want to test whether two strings are equal as in “equal Code Points” or “binary equal”, you need to select a specific collation which distinguishes the code points, such as Latin1_General_BIN:

select case 
    when N'ܐܪܡܝܐ' COLLATE Latin1_General_BIN = N'አማርኛ' COLLATE Latin1_General_BIN 
    then 1 else 0 end

So how can we analyze how collations affect comparison?

First, I created two tables: One stores records about character groups, the second stores the characters in each group that are considered equal by the collation:

CREATE TABLE [dbo].[CollationString](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Collation] [varchar](100) NOT NULL,
    [String] [nvarchar](50) NOT NULL,
    [StringInt] [int] NULL,
 CONSTRAINT [PK_CollationString] PRIMARY KEY CLUSTERED ([Id] ASC))
GO

CREATE TABLE [dbo].[CollationStringEquivalent](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [CollationStringId] [int] NOT NULL,
    [String] [nvarchar](50) NOT NULL,
    [StringInt] [int] NULL,
 CONSTRAINT [PK_CollationStringEquivalent] PRIMARY KEY CLUSTERED ([Id] ASC))
GO

I added a field Collation so that different collations can be compared later on.

Next, we iterate through all characters in the Unicode BMP, look up the character, and insert according to the result:

set nocount on 

declare @c int = 0
declare @coll nvarchar(50) = 'Latin1_General_CI_AS'
declare @cid int

while @c < 65536 begin

    print @c

    set @cid = null;
    select @cid = id 
        from collationstring 
        where collation collate Latin1_General_CI_AS 
            = @coll collate Latin1_General_CI_AS 
        and string = nchar(@c);
    
    if @cid is null begin
        insert into collationstring(collation, string, stringint) 
            values (@coll, NCHAR(@c), @c);
        select @cid = SCOPE_IDENTITY()
    end
        
    insert into collationstringequivalent(collationstringid, string, stringint) 
        values (@cid, NCHAR(@c), @c);

    set @c = @c + 1
end

After running this script, we can now query which characters are considered different

SELECT  * FROM [CollationString]

and which characters are equal

SELECT  * 
FROM [CollationStringEquivalent]
ORDER BY [CollationStringId], [StringInt]

To analyze the distribution of characters in one collation, we can start with a query like this

SELECT COUNT([Id]), [CollationStringId], MIN([String]), MIN([StringInt])
FROM [CollationStringEquivalent]
GROUP BY collationstringid
ORDER BY COUNT(id) DESC, MIN(stringint)

 

Advertisements

One Response to Different Strings considered Equal – Depending on your Collation

  1. […] This is a follow-up article on my previous post about string equality and collations. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: