Unicode Versions supported in SQL Server

December 13, 2009

Update to a previous post on SQL Server and Unicode.

Which version of the Unicode Standard is supported by MS SQL Server?

SQL Server Version Unicode Version
2000 3.0
2005 3.2
2008 5.0

Transliterating Cyrillic into Latin alphabet with T-SQL

December 13, 2009

After sketching my CTE-based multi-REPLACE procedure, I wanted to test in on a real-life scenario. First example is a translation table to convert the cyrillic alphabet into Latin.

First we need to define the data and the translation tables using a case-sensitive cyrillic collation with COLLATE Cyrillic_General_CS_AS:

CREATE TABLE #T (
    level int identity(1,1),
    old    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS,
    new    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS
)

CREATE TABLE #Data (
    data    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS
)

Next, we define the transliterations. See here for a list of standardized transliterations.

INSERT INTO #T VALUES(N'ый', 'y')

INSERT INTO #T VALUES(N'ЫЙ', 'Y')

INSERT INTO #T VALUES(N'а', 'a')
INSERT INTO #T VALUES(N'б', 'b')
INSERT INTO #T VALUES(N'в', 'v')
INSERT INTO #T VALUES(N'г', 'g')
INSERT INTO #T VALUES(N'д', 'd')
INSERT INTO #T VALUES(N'е', 'e')
INSERT INTO #T VALUES(N'ё', 'yo')
INSERT INTO #T VALUES(N'ж', 'zh')
INSERT INTO #T VALUES(N'з', 'z')
INSERT INTO #T VALUES(N'и', 'i')
INSERT INTO #T VALUES(N'й', 'y')
INSERT INTO #T VALUES(N'к', 'k')
INSERT INTO #T VALUES(N'л', 'l')
INSERT INTO #T VALUES(N'м', 'm')
INSERT INTO #T VALUES(N'н', 'n')
INSERT INTO #T VALUES(N'о', 'o')
INSERT INTO #T VALUES(N'п', 'p')
INSERT INTO #T VALUES(N'р', 'r')
INSERT INTO #T VALUES(N'с', 's')
INSERT INTO #T VALUES(N'т', 't')
INSERT INTO #T VALUES(N'у', 'u')
INSERT INTO #T VALUES(N'ф', 'f')
INSERT INTO #T VALUES(N'х', 'kh')
INSERT INTO #T VALUES(N'ц', 'c')
INSERT INTO #T VALUES(N'ч', 'ch')
INSERT INTO #T VALUES(N'ш', 'sh')
INSERT INTO #T VALUES(N'щ', 'shch')
INSERT INTO #T VALUES(N'ъ', '')
INSERT INTO #T VALUES(N'ы', 'y')
INSERT INTO #T VALUES(N'ь', '''')
INSERT INTO #T VALUES(N'э', 'e')
INSERT INTO #T VALUES(N'ю', 'yu')
INSERT INTO #T VALUES(N'я', 'ya')

INSERT INTO #T VALUES(N'А', 'A')
INSERT INTO #T VALUES(N'Б', 'B')
INSERT INTO #T VALUES(N'В', 'V')
INSERT INTO #T VALUES(N'Г', 'G')
INSERT INTO #T VALUES(N'Д', 'D')
INSERT INTO #T VALUES(N'Е', 'E')
INSERT INTO #T VALUES(N'Ё', 'YO')
INSERT INTO #T VALUES(N'Ж', 'ZH')
INSERT INTO #T VALUES(N'З', 'Z')
INSERT INTO #T VALUES(N'И', 'I')
INSERT INTO #T VALUES(N'Й', 'Y')
INSERT INTO #T VALUES(N'К', 'K')
INSERT INTO #T VALUES(N'Л', 'L')
INSERT INTO #T VALUES(N'М', 'M')
INSERT INTO #T VALUES(N'Н', 'N')
INSERT INTO #T VALUES(N'О', 'O')
INSERT INTO #T VALUES(N'П', 'P')
INSERT INTO #T VALUES(N'Р', 'R')
INSERT INTO #T VALUES(N'С', 'S')
INSERT INTO #T VALUES(N'Т', 'T')
INSERT INTO #T VALUES(N'У', 'U')
INSERT INTO #T VALUES(N'Ф', 'F')
INSERT INTO #T VALUES(N'Х', 'KH')
INSERT INTO #T VALUES(N'Ц', 'C')
INSERT INTO #T VALUES(N'Ч', 'CH')
INSERT INTO #T VALUES(N'Ш', 'SH')
INSERT INTO #T VALUES(N'Щ', 'SHCH')
INSERT INTO #T VALUES(N'Ъ', '')
INSERT INTO #T VALUES(N'Ы', 'Y')
INSERT INTO #T VALUES(N'Ь', '''')
INSERT INTO #T VALUES(N'Э', 'E')
INSERT INTO #T VALUES(N'Ю', 'YU')
INSERT INTO #T VALUES(N'Я', 'YA')

Note that the order of INSERT statements is relevant to the transliterated result. That’s we multi-character replacements are stated first.

I chose some random words from the Russian Wikipedia as test cases:

INSERT INTO #Data (data) VALUES (N'Добро пожаловать')
INSERT INTO #Data (data) VALUES (N'Википедия')

INSERT INTO #Data (data) VALUES (N'День Конституции Российской Федерации')
INSERT INTO #Data (data) VALUES (N'География')
INSERT INTO #Data (data) VALUES (N'Америка: Канада (Квебек) — Мексика')
INSERT INTO #Data (data) VALUES (N'Философия')

Using the transliteration CTE, we get the following results:

Америка: Канада (Квебек) — Мексика Amerika: Kanada (Kvebek) — Meksika
Википедия Vikipediya
География Geografiya
День Конституции Российской Федерации Den’ Konstitucii Rossiyskoy Federacii
Добро пожаловать Dobro pozhalovat’
Философия Filosofiya

Multiple String REPLACEs with T-SQL CTEs (2)

December 13, 2009

My algorithm on replacing multiple strings in an SQL Server table works in principle, but soon after publishing the article, I noticed that it has terrible performance once you define more than just 2 or 3 text translations.

Originally I thought it would be a good idea to calculate all permutations of replacements, so that the “best” permutation of an original string would win, but as it turned it: it is not.

The core of the problem is that the LIKE operation creates a recursion to the order of n factorial! I guess that’s one of the least efficient solutions I’ve ever written.

New approach: the translation table contains a sequence identifier, and its translations are thus executed in the given order.

CREATE TABLE #T (
    level INT IDENTITY(1,1),
    old NVARCHAR(50),
    new NVARCHAR(50)
)

CREATE TABLE #Data (
	data NVARCHAR(50)
)

I named the column “level” because it is joined with the level counter of the CTE. The Identity clause causes the translations to the performed in the order of the INSERT statements.

The new CTE with ordered translations:

WITH CTE (org, calc, data, level) AS
(
    SELECT data, data, data, 1
    FROM	#Data
	
    UNION ALL
	
    SELECT CTE.org, CTE.data, 
        CONVERT(NVARCHAR(50), REPLACE(CTE.data, #T.old, #T.new)), 
        CTE.level + 1
    FROM	CTE
    INNER JOIN #T ON #T.level = CTE.level
)
SELECT DISTINCT org, data
FROM CTE
WHERE level = 
    (SELECT MAX(LEVEL) FROM CTE c WHERE CTE.org = c.org)

Multiple String REPLACEs with T-SQL CTEs

December 11, 2009

I had to solve the problem to update data in a table of NVARCHAR strings where several substrings had to be replaced with their respective counterparts, or translations.

Of course the easy way is to apply as many REPLACE functions as you need, but why not try a more generic way?

The string table in its simplest form looks like this:

CREATE TABLE #Data (
	data NVARCHAR(50)
)

INSERT INTO #Data (data) VALUES ('banana')
INSERT INTO #Data (data) VALUES ('apples')

and these are the translations:

CREATE TABLE #T (
    old	NVARCHAR(50),
    new NVARCHAR(50)
)

INSERT INTO #T VALUES('p', 'Q')
INSERT INTO #T VALUES('s', 'Z')
INSERT INTO #T VALUES('a', 'A')

I came up with the following CTE statement:

The CTE stores the original strings, their intermediate translations, and their final translations. A level counter is used to find the entry with the highest count of translations for each string, indicating the final result.

The CTE starts with the original data in all three columns, and level set to zero.

From the original data, all REPLACE results of all translatable texts are calculated. Replacing strings are compared using a case-sensitive collation:

WITH CTE (org, calc, DATA, level) AS
(
    SELECT data, data, data, 0
    FROM	#Data

    UNION ALL

    SELECT CTE.org, CTE.data,
        CONVERT(NVARCHAR(50), REPLACE(CTE.data, #T.old, #T.new)), CTE.level + 1
    FROM	CTE
    INNER JOIN #T ON CTE.data LIKE '%' + #T.old + '%'
        COLLATE Latin1_General_CS_AS
)
SELECT DISTINCT org, data, level
FROM CTE
WHERE level =
    (SELECT MAX(level) FROM CTE c WHERE CTE.org = c.org)

Finally, clean up the temp tables:

DROP TABLE #Data
DROP TABLE #T

This code returns the expected result:

org data level
apples AQQleZ 3
banana bAnAnA 1

100.000 Views

December 10, 2009

According to WordPress Blog Stats.

;)


Listing ON DELETE CASCADE Constraints in MS SQL Server

December 4, 2009

To check which tables have foreign key constraints with referential actions on other tables, use the sys.foreign_keys system view like this:

DECLARE @t NVARCHAR(128)
SET @t = 'My_Table_Name'

SELECT p.name, fk.delete_referential_action_desc, t.name
FROM sys.foreign_keys fk
INNER JOIN sys.objects t ON fk.parent_object_id = t.object_id
INNER JOIN sys.objects p ON fk.referenced_object_id = p.object_id
WHERE p.name = @t
ORDER BY 1, 3

If you want not only master-detail, but also grandparent-parent-child relations, you need a join over 3 tables and 2 foreign keys:

SELECT DISTINCT p.name, fk.delete_referential_action_desc, t.name, 
   fk2.delete_referential_action_desc, tt.name
FROM sys.foreign_keys fk
INNER JOIN sys.objects t ON fk.parent_object_id = t.object_id
INNER JOIN sys.objects p ON fk.referenced_object_id = p.object_id
INNER JOIN sys.foreign_keys fk2 ON fk2.referenced_object_id = t.object_id
INNER JOIN sys.objects tt ON fk2.parent_object_id = tt.object_id
WHERE p.name = @t
AND t.object_id <> p.object_id
ORDER BY 1, 3, 5

The filter on t.object_id <> p.object_id exludes circular relations in the first level which show up in the second level anyway.


Free Unicode Fonts

December 1, 2009

This seems to be a re-occurring topic on Stack Overflow: where can you get free fonts for Unicode characters?

There may be some misunderstandings on Unicode and fonts and Unicode fonts which cause the question, and makes the answer more difficult than you would expect:

From the Unicode website:

The Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world.

The standard defines 107,361 characters or code points as of version 5.2. Unicode also defines a set of properties for each character, and algorithms such as line breaking.

Fonts that “support Unicode” promise to the operating system or to the application that uses them that they provide an accurate graphical representation (glyph) of the code points they implement. A “Unicode font” typically does not cover each and every Unicode code point, but rather only subsets of the standard (Unicode Blocks).

Typically it is the operating system’s responsibility to find a matching font when displaying Unicode text.

So how can you find a free Unicode font supporting most of the characters? Here’s a list of links:

Wikipedia

Alphabetical scripts

CJK fonts

  • Hanazono (52,809 CJK characters, full Ext A, Ext C, partial Ext B)

Complete

Font Lists


Follow

Get every new post delivered to your Inbox.

Join 72 other followers