My algorithm on replacing multiple strings in an SQL Server table works in principle, but soon after publishing the article, I noticed that it has terrible performance once you define more than just 2 or 3 text translations.
Originally I thought it would be a good idea to calculate all permutations of replacements, so that the “best” permutation of an original string would win, but as it turned it: it is not.
The core of the problem is that the LIKE operation creates a recursion to the order of n factorial! I guess that’s one of the least efficient solutions I’ve ever written.
New approach: the translation table contains a sequence identifier, and its translations are thus executed in the given order.
CREATE TABLE #T ( level INT IDENTITY(1,1), old NVARCHAR(50), new NVARCHAR(50) ) CREATE TABLE #Data ( data NVARCHAR(50) )
I named the column “level” because it is joined with the level counter of the CTE. The Identity clause causes the translations to the performed in the order of the INSERT statements.
The new CTE with ordered translations:
WITH CTE (org, calc, data, level) AS ( SELECT data, data, data, 1 FROM #Data UNION ALL SELECT CTE.org, CTE.data, CONVERT(NVARCHAR(50), REPLACE(CTE.data, #T.old, #T.new)), CTE.level + 1 FROM CTE INNER JOIN #T ON #T.level = CTE.level ) SELECT DISTINCT org, data FROM CTE WHERE level = (SELECT MAX(LEVEL) FROM CTE c WHERE CTE.org = c.org)
Pingback: Transliterating Cyrillic into Latin alphabet with T-SQL « devioblog
Pingback: Transliterating Kana into Latin alphabet using T-SQL « devioblog