Calculating the Length of the Longest Common Subsequence in TSQL

When trying to calculate the similarity of strings, the Levenshtein Distance comes up as one way to solve this problem, as it counts the number of additions, deletions and substitutions of characters to transform one string into the other.

The inverse solution is to count the number of characters that are the same in both strings in the same order, known as the Longest Common Subsequence problem (LCS).

The code presented here uses the same technique as the Levenshtein algorithm in storing a two-dimensional array of integers as an NVARCHAR(MAX), i.e. a string of Unicode characters encoding the integers in a matrix.

First, we need a function Max2() to retrieve the maximum of two integers:

CREATE FUNCTION [dbo].[Max2](@a int, @b int)
RETURNS INT AS
BEGIN
IF @a > @b
RETURN @a
RETURN @b
END

and then the T-SQL version of the algorithm:

CREATE FUNCTION [dbo].[LCS]( @s NVARCHAR(MAX), @t NVARCHAR(MAX) )
RETURNS INT AS
BEGIN
DECLARE @d NVARCHAR(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT,
@s_i NCHAR(1), @t_j NCHAR(1)
SET @n = LEN(@s)
IF @n = 0 RETURN 0
SET @m = LEN(@t)
IF @m = 0 RETURN 0
SET @d = REPLICATE(NCHAR(0),(@n+1)*(@m+1))
SET @i = 1
WHILE @i <= @n BEGIN
SET @s_i = SUBSTRING(@s,@i,1)
SET @j = 1
WHILE @j <= @m BEGIN
SET @t_j = SUBSTRING(@t,@j,1)
IF @s_i = @t_j
SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,
NCHAR(UNICODE(
SUBSTRING(@d, (@j-1)*(@n+1)+@i-1+1, 1)
)+1))
ELSE
SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,NCHAR(dbo.Max2(
UNICODE(SUBSTRING(@d,@j*(@n+1)+@i-1+1,1)),
UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i+1,1)))))
SET @j = @j+1
END
SET @i = @i+1
END
SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1)+@m+1,1))
RETURN @LD
END

This entry was posted on Thursday, September 16th, 2010 at 16:03 and is filed under Algorithms, SQL, SQL Server. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.