When trying to calculate the similarity of strings, the Levenshtein Distance comes up as one way to solve this problem, as it counts the number of additions, deletions and substitutions of characters to transform one string into the other.
The inverse solution is to count the number of characters that are the same in both strings in the same order, known as the Longest Common Subsequence problem (LCS).
The code presented here uses the same technique as the Levenshtein algorithm in storing a two-dimensional array of integers as an NVARCHAR(MAX), i.e. a string of Unicode characters encoding the integers in a matrix.
First, we need a function Max2() to retrieve the maximum of two integers:
CREATE FUNCTION [dbo].[Max2](@a int, @b int) RETURNS INT AS BEGIN IF @a > @b RETURN @a RETURN @b END
and then the T-SQL version of the algorithm:
CREATE FUNCTION [dbo].[LCS]( @s NVARCHAR(MAX), @t NVARCHAR(MAX) ) RETURNS INT AS BEGIN DECLARE @d NVARCHAR(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT, @s_i NCHAR(1), @t_j NCHAR(1) SET @n = LEN(@s) IF @n = 0 RETURN 0 SET @m = LEN(@t) IF @m = 0 RETURN 0 SET @d = REPLICATE(NCHAR(0),(@n+1)*(@m+1)) SET @i = 1 WHILE @i <= @n BEGIN SET @s_i = SUBSTRING(@s,@i,1) SET @j = 1 WHILE @j <= @m BEGIN SET @t_j = SUBSTRING(@t,@j,1) IF @s_i = @t_j SET @d = STUFF(@d,@j*(@n+1)+@i+1,1, NCHAR(UNICODE( SUBSTRING(@d, (@j-1)*(@n+1)+@i-1+1, 1) )+1)) ELSE SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,NCHAR(dbo.Max2( UNICODE(SUBSTRING(@d,@j*(@n+1)+@i-1+1,1)), UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i+1,1))))) SET @j = @j+1 END SET @i = @i+1 END SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1)+@m+1,1)) RETURN @LD END