Calculating Levenshtein Distance in TSQL

I tried to find a method to compare strings according to their similarity, and first came across the Levenshtein distance which defines the distance (or degree of similarity) between two strings as the minimum number of additions, deletions, and substitutions of single characters needed to transform one string into the other.

I found this implementation of the Levenshtein algorithm in T-SQL, but noted a couple of errors:

First, the function returns a VARCHAR result, where you would expect an INT.

Next, due to the restriction of parameters and variables to VARCHAR(50) and VARCHAR(100), only strings with a limited number of characters could be compared. (The code may have been written before the introduction of VARCHAR(MAX)).

Furthermore, the distance matrix is stored in an array of CHAR, which only allows for a maximum difference of 255 characters to be handled correctly.

The Levenshtein algorithm requires a function to find the minimum of 3 integers:

create function [dbo].[min3](@a int, @b int, @c int)
returns int as
begin
  declare @min int
  set @min = @a
  if @b < @min set @min = @b
  if @c < @min set @min = @c
  return @min
end

And this is the code:

CREATE FUNCTION [dbo].[LEVENSHTEIN]( @s NVARCHAR(MAX), @t NVARCHAR(MAX) )
/*
Levenshtein Distance Algorithm: TSQL Implementation
by Joseph Gama
http://www.merriampark.com/ldtsql.htm

Returns the Levenshtein Distance between strings s1 and s2.
Original developer: Michael Gilleland http://www.merriampark.com/ld.htm
Translated to TSQL by Joseph Gama

Fixed by Herbert Oppolzer / devio
as described in https://devio.wordpress.com/2010/09/07/calculating-levenshtein-distance-in-tsql
*/
RETURNS INT AS
BEGIN
  DECLARE @d NVARCHAR(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT,
    @s_i NCHAR(1), @t_j NCHAR(1),@cost INT

  --Step 1
  SET @n = LEN(@s)
  SET @m = LEN(@t)
  SET @d = REPLICATE(NCHAR(0),(@n+1)*(@m+1))
  IF @n = 0
  BEGIN
    SET @LD = @m
   GOTO done
  END
  IF @m = 0
  BEGIN
    SET @LD = @n
    GOTO done
  END

  --Step 2
  SET @i = 0
  WHILE @i <= @n BEGIN
    SET @d = STUFF(@d,@i+1,1,NCHAR(@i))        --d(i, 0) = i
    SET @i = @i+1
  END

  SET @i = 0
  WHILE @i <= @m BEGIN
    SET @d = STUFF(@d,@i*(@n+1)+1,1,NCHAR(@i))    --d(0, j) = j
    SET @i = @i+1
  END

  --Step 3
  SET @i = 1
  WHILE @i <= @n BEGIN
    SET @s_i = SUBSTRING(@s,@i,1)

    --Step 4
    SET @j = 1
    WHILE @j <= @m BEGIN
      SET @t_j = SUBSTRING(@t,@j,1)
      --Step 5
      IF @s_i = @t_j
        SET @cost = 0
      ELSE
        SET @cost = 1
      --Step 6
      SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,
        NCHAR(dbo.MIN3(
          UNICODE(SUBSTRING(@d,@j*(@n+1)+@i-1+1,1))+1,
          UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i+1,1))+1,
          UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i-1+1,1))+@cost)
        ))
      SET @j = @j+1
    END
    SET @i = @i+1
  END      

  --Step 7
  SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1)+@m+1,1))

done:
  RETURN @LD
END

4 thoughts on “Calculating Levenshtein Distance in TSQL

  1. Pingback: TSQL String Functions « devioblog

  2. Pingback: Calculating the Length of the Longest Common Subsequence in TSQL « devioblog

  3. Pingback: string distance dengan levenshtein di SQL SERVER | new BlogException();

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.