More Sample Databases and Free Datasets

October 12, 2011

As I prepared my previous post on SQL Server sample databases, I also came across sample databases for other SQL databases.

PgFoundry provides a couple of database samples for PostgreSql. Real-world applications such as OpenNMS also use PostgreSql.

For MySql, I found the AdventureWorks database for MySql on SourceForge. This question on SO also covers sample databases for MySql.

Of course, a lot of real-world applications use MySql as their datastore, such as Bugzilla, MediaWiki, WordPress, or Joomla.

One question on SO asked for freely available datasets, and one answer points to an amazing list of datasets available on the web.

Makes me wonder whether there exists a comprehensive list of freely available data, all tagged with hierarchical categories?


Calculating the Length of the Longest Common Subsequence in TSQL

September 16, 2010

When trying to calculate the similarity of strings, the Levenshtein Distance comes up as one way to solve this problem, as it counts the number of additions, deletions and substitutions of characters to transform one string into the other.

The inverse solution is to count the number of characters that are the same in both strings in the same order, known as the Longest Common Subsequence problem (LCS).

The code presented here uses the same technique as the Levenshtein algorithm in storing a two-dimensional array of integers as an NVARCHAR(MAX), i.e. a string of Unicode characters encoding the integers in a matrix.

First, we need a function Max2() to retrieve the maximum of two integers:

CREATE FUNCTION [dbo].[Max2](@a int, @b int)
RETURNS INT AS
BEGIN
  IF @a > @b
    RETURN @a
  RETURN @b
END

and then the T-SQL version of the algorithm:

CREATE FUNCTION [dbo].[LCS]( @s NVARCHAR(MAX), @t NVARCHAR(MAX) )
RETURNS INT AS
BEGIN
  DECLARE @d NVARCHAR(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT, 
    @s_i NCHAR(1), @t_j NCHAR(1)

  SET @n = LEN(@s)
  IF @n = 0 RETURN 0

  SET @m = LEN(@t)
  IF @m = 0 RETURN 0

  SET @d = REPLICATE(NCHAR(0),(@n+1)*(@m+1))

  SET @i = 1
  WHILE @i <= @n BEGIN
    SET @s_i = SUBSTRING(@s,@i,1)

    SET @j = 1
    WHILE @j <= @m BEGIN
      SET @t_j = SUBSTRING(@t,@j,1)

      IF @s_i = @t_j

        SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,
          NCHAR(UNICODE(
            SUBSTRING(@d, (@j-1)*(@n+1)+@i-1+1, 1)
            )+1))

      ELSE

        SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,NCHAR(dbo.Max2(
          UNICODE(SUBSTRING(@d,@j*(@n+1)+@i-1+1,1)),
          UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i+1,1)))))

      SET @j = @j+1
    END
    SET @i = @i+1
  END      

  SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1)+@m+1,1))
  RETURN @LD
END

Searching TSQL Stored Procedures (and other Modules)

September 10, 2010

Every now and then I come across the question, “How do I search my stored procedures for a certain table or column name in MS SQL Server?”. You might have, too…

SQL Server up to version 2000 provided the view syscomments, which is still implemented in versions 2005 and higher as sys.syscomments. (All system catalog views have been moved to the sys schema in 2005)

The major drawback with syscomments is that its text column containing the stored procedure code is defined as NVARCHAR(4000), the longest possible string value in SQL Server 2000 (apart from NTEXT). So if the procedure (or function) code is longer than 4000 characters, the code is sliced into several 4000 character records. And as it happens, the string you are searching for may end up right at the boundary of two records, complicating the search algorithm.

SQL Server 2005 introduced the sys.sql_modules view which has a Definition NVARCHAR(MAX) column holding code, thus resolving this issue of split code. (check the remarks if you want to search DEFAULT and CHECK expressions, too)

So now that you happily accepted that search sql_modules is better than searching syscomments, here is the code to perform the search in TSQL:

DECLARE @Search NVARCHAR(MAX)

SET @Search = 'MyOldColumnName'

SELECT sm.object_id, OBJECT_NAME(sm.object_id) AS object_name, -- o.type,
       o.type_desc   -- , sm.definition
FROM sys.sql_modules AS sm
JOIN sys.objects AS o ON sm.object_id = o.object_id
WHERE sm.definition LIKE N'%' + @Search + N'%'
ORDER BY o.type, OBJECT_NAME(sm.object_id);

As you see, the search is executed using the LIKE operation. But what if you want to search for wildcards as literals, such as _ and %, or the regex characters [, ], -  (because, for example, your database object names may contain underscores) ?

Simply escape the characters using the [] notation:

SET @Search = REPLACE(@Search, '[', '[[]')
SET @Search = REPLACE(@Search, '%', '[%]')
SET @Search = REPLACE(@Search, '_', '[_]')

and run the query after the REPLACE operations.

But what if you want to search for whole words only? Use PATINDEX using a regex which excludes alphabetical characters and underscores before and after the search string:

SELECT sm.object_id, OBJECT_NAME(sm.object_id) AS object_name, o.type, 
    o.type_desc, sm.definition
FROM sys.sql_modules AS sm
JOIN sys.objects AS o ON sm.object_id = o.object_id
WHERE PATINDEX(N'%[^a-zA-Z_]' + @Search + N'[^a-zA-Z_]%', sm.definition) > 0
ORDER BY o.type, OBJECT_NAME(sm.object_id);

TSQL String Functions

September 7, 2010

While searching for appropriate string comparison functions in TSQL, I came across these pages implementing a couple of

string functions: Capitalize, Center, Count Substring in String, EndsWith, ExpandTabs, IsAlnum, IsAlpha,
IsDigit, IsLower, IsTitle, IsSpace, LJust, LStrip, RFind, RJust, RStrip, Strip, SwapCase, Title, Zfill

and string tuple functions: Split, SplitLines, Within, EndsWith, StartsWith, Contains, Join, Parts, Partition, RPartition

See here for more TSQL articles by the authors.


Calculating Levenshtein Distance in TSQL

September 7, 2010

I tried to find a method to compare strings according to their similarity, and first came across the Levenshtein distance which defines the distance (or degree of similarity) between two strings as the minimum number of additions, deletions, and substitutions of single characters needed to transform one string into the other.

I found this implementation of the Levenshtein algorithm in T-SQL, but noted a couple of errors:

First, the function returns a VARCHAR result, where you would expect an INT.

Next, due to the restriction of parameters and variables to VARCHAR(50) and VARCHAR(100), only strings with a limited number of characters could be compared. (The code may have been written before the introduction of VARCHAR(MAX)).

Furthermore, the distance matrix is stored in an array of CHAR, which only allows for a maximum difference of 255 characters to be handled correctly.

The Levenshtein algorithm requires a function to find the minimum of 3 integers:

create function [dbo].[min3](@a int, @b int, @c int)
returns int as
begin
  declare @min int
  set @min = @a
  if @b < @min set @min = @b
  if @c < @min set @min = @c
  return @min
end

And this is the code:

CREATE FUNCTION [dbo].[LEVENSHTEIN]( @s NVARCHAR(MAX), @t NVARCHAR(MAX) )
/*
Levenshtein Distance Algorithm: TSQL Implementation
by Joseph Gama

http://www.merriampark.com/ldtsql.htm

Returns the Levenshtein Distance between strings s1 and s2.
Original developer: Michael Gilleland http://www.merriampark.com/ld.htm
Translated to TSQL by Joseph Gama

Fixed by Herbert Oppolzer / devio
as described in http://devio.wordpress.com/2010/09/07/calculating-levenshtein-distance-in-tsql
*/
RETURNS INT AS
BEGIN
  DECLARE @d NVARCHAR(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT,
    @s_i NCHAR(1), @t_j NCHAR(1),@cost INT

  --Step 1
  SET @n = LEN(@s)
  SET @m = LEN(@t)
  SET @d = REPLICATE(NCHAR(0),(@n+1)*(@m+1))
  IF @n = 0
  BEGIN
    SET @LD = @m
   GOTO done
  END
  IF @m = 0
  BEGIN
    SET @LD = @n
    GOTO done
  END

  --Step 2
  SET @i = 0
  WHILE @i <= @n BEGIN
    SET @d = STUFF(@d,@i+1,1,NCHAR(@i))        --d(i, 0) = i
    SET @i = @i+1
  END

  SET @i = 0
  WHILE @i <= @m BEGIN
    SET @d = STUFF(@d,@i*(@n+1)+1,1,NCHAR(@i))    --d(0, j) = j
    SET @i = @i+1
  END

  --Step 3
  SET @i = 1
  WHILE @i <= @n BEGIN
    SET @s_i = SUBSTRING(@s,@i,1)

    --Step 4
    SET @j = 1
    WHILE @j <= @m BEGIN
      SET @t_j = SUBSTRING(@t,@j,1)
      --Step 5
      IF @s_i = @t_j
        SET @cost = 0
      ELSE
        SET @cost = 1
      --Step 6
      SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,
        NCHAR(dbo.MIN3(
          UNICODE(SUBSTRING(@d,@j*(@n+1)+@i-1+1,1))+1,
          UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i+1,1))+1,
          UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i-1+1,1))+@cost)
        ))
      SET @j = @j+1
    END
    SET @i = @i+1
  END      

  --Step 7
  SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1)+@m+1,1))

done:
  RETURN @LD
END

Verifying Foreign Keys in MS SQL Server

February 18, 2010

During a recent data check we noticed that foreign key columns contained values which were not stored in the referenced tables. The foreign key constraints were there, but they had been deactivated with the WITH NOCHECK clause.

This blog post was quite helpful to understand disabled and trusted foreign key constraints.

Using this information, I wrote the following script to verify foreign key constraints

DECLARE @t sysname, @fk sysname, @s NVARCHAR(1000)

DECLARE c CURSOR FOR
SELECT t.name, fk.name
FROM sys.foreign_keys fk
INNER JOIN sys.objects t ON t.object_id = fk.parent_object_id
ORDER BY 1, 2

OPEN c 
FETCH c INTO @t, @fk

WHILE @@FETCH_STATUS = 0 BEGIN
	
	SET @s = 'ALTER TABLE ' + @t + 
		' WITH CHECK CHECK CONSTRAINT ' + @fk
	PRINT @s
	
	BEGIN TRY
	EXEC (@s)
	END TRY
	BEGIN CATCH
		PRINT 'Conflict in ' + @t + N'.' + @fk
		DBCC CHECKCONSTRAINTS (@t)
	END CATCH
	
	FETCH c INTO @t, @fk
END

CLOSE c
DEALLOCATE c

SELECT t.name, fk.name, fk.is_disabled, fk.is_not_trusted
FROM sys.foreign_keys fk
INNER JOIN sys.objects t ON t.object_id = fk.parent_object_id
WHERE fk.is_not_trusted = 1
ORDER BY 1, 2

The script tries to enable each foreign key constraint it finds in the database using the WITH CHECK CHECK CONSTRAINT clause.

If the check fails, it calls DBCC CHECKCONSTRAINTS which returns the set of recordings violating the foreign key constraint.

Finally, the list of all foreign keys is selected which are set to “not trusted” (see linked blog above).

The result of the script execution shows you which tables and records need to be cleaned up.

Run the script repeatedly after data cleanup until a single empty result set is returned.


Listing ON DELETE CASCADE Constraints in MS SQL Server 2000

January 23, 2010

I needed to run my checks on ON DELETE CASCADE Constraints in MS SQL Server on SQL Server 2000, and needed to adjust the system table names and columns accordingly.

These are the queries adjusted for SQL Server 2000:

DECLARE @t NVARCHAR(128)
SET @t = 'MyTable'

SELECT p.name,
    objectproperty(fk.constid, 'CnstIsDeleteCascade') OnDelete,
    t.name
FROM dbo.sysforeignkeys fk
INNER JOIN dbo.sysobjects t ON fk.fkeyid = t.id
INNER JOIN dbo.sysobjects p ON fk.rkeyid = p.id
WHERE p.name = @t
ORDER BY 1, 3

SELECT DISTINCT p.name,
    objectproperty(fk.constid, 'CnstIsDeleteCascade') OnDelete,
    t.name,
    objectproperty(fk2.constid, 'CnstIsDeleteCascade') OnDelete,
    tt.name
FROM sysforeignkeys fk
INNER JOIN sysobjects t ON fk.fkeyid = t.id
INNER JOIN sysobjects p ON fk.rkeyid = p.id
INNER JOIN sysforeignkeys fk2 ON fk2.rkeyid = t.id
INNER JOIN sysobjects tt ON fk2.fkeyid = tt.id
WHERE p.name = @t
AND t.id <> p.id
ORDER BY 1, 3, 5

Splitting long text columns into multiple NVARCHAR records

December 22, 2009

Management Studio truncates strings displayed in the result grid which keeps you from viewing the whole string stored in a text column.

A question on Stack Overflow made me create this SELECT statement to split long strings into several records:

SELECT texts.id, 
       SUBSTRING(texts.text_column, number*100+1, 100)
FROM
    (SELECT texts.id, texts.text_column, 
        (99+LEN(texts.text_column))/100 AS l
     FROM texts) AS texts
OUTER APPLY
    (SELECT TOP(l) number
     FROM master.dbo.spt_values val
     WHERE name IS NULL
     ORDER BY number) n

In this example, the resulting strings are 100 characters long. The built-in table master.dbo.spt_values contains integer values from 0 to 2047.


Transliterating Cyrillic into Latin alphabet with T-SQL

December 13, 2009

After sketching my CTE-based multi-REPLACE procedure, I wanted to test in on a real-life scenario. First example is a translation table to convert the cyrillic alphabet into Latin.

First we need to define the data and the translation tables using a case-sensitive cyrillic collation with COLLATE Cyrillic_General_CS_AS:

CREATE TABLE #T (
    level int identity(1,1),
    old    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS,
    new    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS
)

CREATE TABLE #Data (
    data    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS
)

Next, we define the transliterations. See here for a list of standardized transliterations.

INSERT INTO #T VALUES(N'ый', 'y')

INSERT INTO #T VALUES(N'ЫЙ', 'Y')

INSERT INTO #T VALUES(N'а', 'a')
INSERT INTO #T VALUES(N'б', 'b')
INSERT INTO #T VALUES(N'в', 'v')
INSERT INTO #T VALUES(N'г', 'g')
INSERT INTO #T VALUES(N'д', 'd')
INSERT INTO #T VALUES(N'е', 'e')
INSERT INTO #T VALUES(N'ё', 'yo')
INSERT INTO #T VALUES(N'ж', 'zh')
INSERT INTO #T VALUES(N'з', 'z')
INSERT INTO #T VALUES(N'и', 'i')
INSERT INTO #T VALUES(N'й', 'y')
INSERT INTO #T VALUES(N'к', 'k')
INSERT INTO #T VALUES(N'л', 'l')
INSERT INTO #T VALUES(N'м', 'm')
INSERT INTO #T VALUES(N'н', 'n')
INSERT INTO #T VALUES(N'о', 'o')
INSERT INTO #T VALUES(N'п', 'p')
INSERT INTO #T VALUES(N'р', 'r')
INSERT INTO #T VALUES(N'с', 's')
INSERT INTO #T VALUES(N'т', 't')
INSERT INTO #T VALUES(N'у', 'u')
INSERT INTO #T VALUES(N'ф', 'f')
INSERT INTO #T VALUES(N'х', 'kh')
INSERT INTO #T VALUES(N'ц', 'c')
INSERT INTO #T VALUES(N'ч', 'ch')
INSERT INTO #T VALUES(N'ш', 'sh')
INSERT INTO #T VALUES(N'щ', 'shch')
INSERT INTO #T VALUES(N'ъ', '')
INSERT INTO #T VALUES(N'ы', 'y')
INSERT INTO #T VALUES(N'ь', '''')
INSERT INTO #T VALUES(N'э', 'e')
INSERT INTO #T VALUES(N'ю', 'yu')
INSERT INTO #T VALUES(N'я', 'ya')

INSERT INTO #T VALUES(N'А', 'A')
INSERT INTO #T VALUES(N'Б', 'B')
INSERT INTO #T VALUES(N'В', 'V')
INSERT INTO #T VALUES(N'Г', 'G')
INSERT INTO #T VALUES(N'Д', 'D')
INSERT INTO #T VALUES(N'Е', 'E')
INSERT INTO #T VALUES(N'Ё', 'YO')
INSERT INTO #T VALUES(N'Ж', 'ZH')
INSERT INTO #T VALUES(N'З', 'Z')
INSERT INTO #T VALUES(N'И', 'I')
INSERT INTO #T VALUES(N'Й', 'Y')
INSERT INTO #T VALUES(N'К', 'K')
INSERT INTO #T VALUES(N'Л', 'L')
INSERT INTO #T VALUES(N'М', 'M')
INSERT INTO #T VALUES(N'Н', 'N')
INSERT INTO #T VALUES(N'О', 'O')
INSERT INTO #T VALUES(N'П', 'P')
INSERT INTO #T VALUES(N'Р', 'R')
INSERT INTO #T VALUES(N'С', 'S')
INSERT INTO #T VALUES(N'Т', 'T')
INSERT INTO #T VALUES(N'У', 'U')
INSERT INTO #T VALUES(N'Ф', 'F')
INSERT INTO #T VALUES(N'Х', 'KH')
INSERT INTO #T VALUES(N'Ц', 'C')
INSERT INTO #T VALUES(N'Ч', 'CH')
INSERT INTO #T VALUES(N'Ш', 'SH')
INSERT INTO #T VALUES(N'Щ', 'SHCH')
INSERT INTO #T VALUES(N'Ъ', '')
INSERT INTO #T VALUES(N'Ы', 'Y')
INSERT INTO #T VALUES(N'Ь', '''')
INSERT INTO #T VALUES(N'Э', 'E')
INSERT INTO #T VALUES(N'Ю', 'YU')
INSERT INTO #T VALUES(N'Я', 'YA')

Note that the order of INSERT statements is relevant to the transliterated result. That’s we multi-character replacements are stated first.

I chose some random words from the Russian Wikipedia as test cases:

INSERT INTO #Data (data) VALUES (N'Добро пожаловать')
INSERT INTO #Data (data) VALUES (N'Википедия')

INSERT INTO #Data (data) VALUES (N'День Конституции Российской Федерации')
INSERT INTO #Data (data) VALUES (N'География')
INSERT INTO #Data (data) VALUES (N'Америка: Канада (Квебек) — Мексика')
INSERT INTO #Data (data) VALUES (N'Философия')

Using the transliteration CTE, we get the following results:

Америка: Канада (Квебек) — Мексика Amerika: Kanada (Kvebek) — Meksika
Википедия Vikipediya
География Geografiya
День Конституции Российской Федерации Den’ Konstitucii Rossiyskoy Federacii
Добро пожаловать Dobro pozhalovat’
Философия Filosofiya

Multiple String REPLACEs with T-SQL CTEs (2)

December 13, 2009

My algorithm on replacing multiple strings in an SQL Server table works in principle, but soon after publishing the article, I noticed that it has terrible performance once you define more than just 2 or 3 text translations.

Originally I thought it would be a good idea to calculate all permutations of replacements, so that the “best” permutation of an original string would win, but as it turned it: it is not.

The core of the problem is that the LIKE operation creates a recursion to the order of n factorial! I guess that’s one of the least efficient solutions I’ve ever written.

New approach: the translation table contains a sequence identifier, and its translations are thus executed in the given order.

CREATE TABLE #T (
    level INT IDENTITY(1,1),
    old NVARCHAR(50),
    new NVARCHAR(50)
)

CREATE TABLE #Data (
	data NVARCHAR(50)
)

I named the column “level” because it is joined with the level counter of the CTE. The Identity clause causes the translations to the performed in the order of the INSERT statements.

The new CTE with ordered translations:

WITH CTE (org, calc, data, level) AS
(
    SELECT data, data, data, 1
    FROM	#Data
	
    UNION ALL
	
    SELECT CTE.org, CTE.data, 
        CONVERT(NVARCHAR(50), REPLACE(CTE.data, #T.old, #T.new)), 
        CTE.level + 1
    FROM	CTE
    INNER JOIN #T ON #T.level = CTE.level
)
SELECT DISTINCT org, data
FROM CTE
WHERE level = 
    (SELECT MAX(LEVEL) FROM CTE c WHERE CTE.org = c.org)

Follow

Get every new post delivered to your Inbox.