Transliterating Kana into Latin alphabet using T-SQL

One more example of replacing multiple strings using a CTE. After Russian / Cyrillic, it’s time for the Japanese syllabary.

SQL Server provides a couple of Japanese collations of which I found Japanese_Unicode_CI_AS useful: It is insensitive to the distinction between hiragana and katakana, but accent-sensitive to nigori and handakuten.

Let’s start with the table definitions:

CREATE TABLE #T (
    level int identity(1,1),
    old    NVARCHAR(50) COLLATE Japanese_Unicode_CI_AS,
    new    NVARCHAR(50) COLLATE Japanese_Unicode_CI_AS
)

CREATE TABLE #Data (
    data    NVARCHAR(50) COLLATE Japanese_Unicode_CI_AS
)

and the data to be translated, taken randomly from the Japanese Wikipedia:

INSERT INTO #Data (data) VALUES (N'ひらがな')

INSERT INTO #Data (data) VALUES (N'ヒラガナ')
INSERT INTO #Data (data) VALUES (N'とうきょう')
INSERT INTO #Data (data) VALUES (N'せんせい')
INSERT INTO #Data (data) VALUES (N'ウィキペディア')

INSERT INTO #Data (data) VALUES (N'ウィキポータル')
INSERT INTO #Data (data) VALUES (N'ヨーロッパ')
INSERT INTO #Data (data) VALUES (N'フランス')
INSERT INTO #Data (data) VALUES (N'イタリア')
INSERT INTO #Data (data) VALUES (N'ブルガリア')
INSERT INTO #Data (data) VALUES (N'アメリカ')
INSERT INTO #Data (data) VALUES (N'アジア')

The CTE as described in the previous posts:

;
WITH CTE (org, calc, data, level) AS
(
    SELECT data, data, data, 1
    FROM	#Data

    UNION ALL

    SELECT CTE.org, CTE.data, CONVERT(NVARCHAR(50),
        REPLACE(CTE.data, #T.old, #T.new)), CTE.level + 1
    FROM	CTE
    INNER JOIN #T ON #T.level = CTE.level
)
SELECT DISTINCT org, data
FROM CTE
WHERE level = (SELECT MAX(LEVEL) FROM CTE c WHERE CTE.org = c.org)
OPTION (MAXRECURSION 1000)

DROP TABLE #Data
DROP TABLE #T

Since the transliteration table has more than 100 records (the default maximum recursion level is 100), the option (MAXRECURSION 1000) has to be added.

This results in the following transliteration of some Japanese kana words:

アジア ajia
アメリカ amerika
イタリア itaria
ウィキペディア wikipedia
ウィキポータル wikipootaru
さん san
せんせい sensei
とうきょう toukyou
ヒラガナ hiragana
フランス furansu
ブルガリア burugaria
ヨーロッパ yooroppa

This is the contents of the transliteration table:

We start with 2-character transliterations

INSERT INTO #T VALUES(N'きゃ', 'kya')
INSERT INTO #T VALUES(N'きゅ', 'kyu')
INSERT INTO #T VALUES(N'きょ', 'kyo')
INSERT INTO #T VALUES(N'しゃ', 'sha')
INSERT INTO #T VALUES(N'しゅ', 'shu')
INSERT INTO #T VALUES(N'しょ', 'sho')
INSERT INTO #T VALUES(N'ちゃ', 'cha')
INSERT INTO #T VALUES(N'ちゅ', 'chu')
INSERT INTO #T VALUES(N'ちょ', 'cho')

INSERT INTO #T VALUES(N'てぃ', 'ti')
INSERT INTO #T VALUES(N'デぃ', 'di')

INSERT INTO #T VALUES(N'にゃ', 'nya')
INSERT INTO #T VALUES(N'にゅ', 'nyu')
INSERT INTO #T VALUES(N'にょ', 'nyo')
INSERT INTO #T VALUES(N'ひゃ', 'hya')
INSERT INTO #T VALUES(N'ひゅ', 'hyu')
INSERT INTO #T VALUES(N'ひょ', 'hyo')
INSERT INTO #T VALUES(N'みゃ', 'mya')
INSERT INTO #T VALUES(N'みゅ', 'myu')
INSERT INTO #T VALUES(N'みょ', 'myo')
INSERT INTO #T VALUES(N'りゃ', 'rya')
INSERT INTO #T VALUES(N'りゅ', 'ryu')
INSERT INTO #T VALUES(N'りょ', 'ryo')

INSERT INTO #T VALUES(N'ぎゃ', 'gya')
INSERT INTO #T VALUES(N'ぎゅ', 'gyu')
INSERT INTO #T VALUES(N'ぎょ', 'gyo')
INSERT INTO #T VALUES(N'じゃ', 'ja')
INSERT INTO #T VALUES(N'じゅ', 'ju')
INSERT INTO #T VALUES(N'じょ', 'jo')
INSERT INTO #T VALUES(N'ぢゃ', 'dja')
INSERT INTO #T VALUES(N'ぢゅ', 'dju')
INSERT INTO #T VALUES(N'ぢょ', 'djo')

INSERT INTO #T VALUES(N'びゃ', 'bya')
INSERT INTO #T VALUES(N'びゅ', 'byu')
INSERT INTO #T VALUES(N'びょ', 'byo')

INSERT INTO #T VALUES(N'ぴゃ', 'pya')
INSERT INTO #T VALUES(N'ぴゅ', 'pyu')
INSERT INTO #T VALUES(N'ぴょ', 'pyo')

INSERT INTO #T VALUES(N'ウィ', 'wi')

Now the regular transliteration of the syllables:

INSERT INTO #T VALUES(N'あ', 'a')
INSERT INTO #T VALUES(N'い', 'i')
INSERT INTO #T VALUES(N'う', 'u')
INSERT INTO #T VALUES(N'え', 'e')
INSERT INTO #T VALUES(N'お', 'o')
INSERT INTO #T VALUES(N'か', 'ka')
INSERT INTO #T VALUES(N'き', 'ki')
INSERT INTO #T VALUES(N'く', 'ku')
INSERT INTO #T VALUES(N'け', 'ke')
INSERT INTO #T VALUES(N'こ', 'ki')
INSERT INTO #T VALUES(N'さ', 'sa')
INSERT INTO #T VALUES(N'し', 'shi')
INSERT INTO #T VALUES(N'す', 'su')
INSERT INTO #T VALUES(N'せ', 'se')
INSERT INTO #T VALUES(N'そ', 'so')
INSERT INTO #T VALUES(N'た', 'ta')
INSERT INTO #T VALUES(N'ち', 'chi')
INSERT INTO #T VALUES(N'つ', 'tsu')
INSERT INTO #T VALUES(N'て', 'te')
INSERT INTO #T VALUES(N'と', 'to')
INSERT INTO #T VALUES(N'な', 'na')
INSERT INTO #T VALUES(N'に', 'ni')
INSERT INTO #T VALUES(N'ぬ', 'nu')
INSERT INTO #T VALUES(N'ね', 'ne')
INSERT INTO #T VALUES(N'の', 'no')
INSERT INTO #T VALUES(N'は', 'ha')
INSERT INTO #T VALUES(N'ひ', 'hi')
INSERT INTO #T VALUES(N'ふ', 'fu')
INSERT INTO #T VALUES(N'へ', 'he')
INSERT INTO #T VALUES(N'ほ', 'ho')

INSERT INTO #T VALUES(N'ま', 'ma')
INSERT INTO #T VALUES(N'み', 'mi')
INSERT INTO #T VALUES(N'む', 'mu')
INSERT INTO #T VALUES(N'め', 'me')
INSERT INTO #T VALUES(N'も', 'mo')

INSERT INTO #T VALUES(N'や', 'ya')
INSERT INTO #T VALUES(N'ゆ', 'yu')
INSERT INTO #T VALUES(N'よ', 'yo')

INSERT INTO #T VALUES(N'ら', 'ra')
INSERT INTO #T VALUES(N'り', 'ri')
INSERT INTO #T VALUES(N'る', 'ru')
INSERT INTO #T VALUES(N'れ', 're')
INSERT INTO #T VALUES(N'ろ', 'ro')

INSERT INTO #T VALUES(N'わ', 'wa')
INSERT INTO #T VALUES(N'ゐ', 'wi')
INSERT INTO #T VALUES(N'ゑ', 'we')
INSERT INTO #T VALUES(N'を', 'wo')

INSERT INTO #T VALUES(N'ん', 'n')

INSERT INTO #T VALUES(N'が', 'ga')
INSERT INTO #T VALUES(N'ぎ', 'gi')
INSERT INTO #T VALUES(N'ぐ', 'gu')
INSERT INTO #T VALUES(N'げ', 'ge')
INSERT INTO #T VALUES(N'ご', 'go')

INSERT INTO #T VALUES(N'ざ', 'za')
INSERT INTO #T VALUES(N'じ', 'ji')
INSERT INTO #T VALUES(N'ず', 'zu')
INSERT INTO #T VALUES(N'ぜ', 'ze')
INSERT INTO #T VALUES(N'ぞ', 'zo')
INSERT INTO #T VALUES(N'だ', 'da')
INSERT INTO #T VALUES(N'ぢ', 'dji')
INSERT INTO #T VALUES(N'づ', 'dzu')
INSERT INTO #T VALUES(N'で', 'de')
INSERT INTO #T VALUES(N'ど', 'do')

INSERT INTO #T VALUES(N'ば', 'ba')
INSERT INTO #T VALUES(N'び', 'bi')
INSERT INTO #T VALUES(N'ぶ', 'bu')
INSERT INTO #T VALUES(N'べ', 'be')
INSERT INTO #T VALUES(N'ぼ', 'bo')

INSERT INTO #T VALUES(N'ぱ', 'pa')
INSERT INTO #T VALUES(N'ぴ', 'pi')
INSERT INTO #T VALUES(N'ぷ', 'pu')
INSERT INTO #T VALUES(N'ぺ', 'pe')
INSERT INTO #T VALUES(N'ぽ', 'po')

And finally we add support for small tsu and katakana vowel lengthening:

INSERT INTO #T VALUES(N'っk', 'kk')
INSERT INTO #T VALUES(N'っp', 'pp')
INSERT INTO #T VALUES(N'っs', 'ss')
INSERT INTO #T VALUES(N'っt', 'tt')
INSERT INTO #T VALUES(N'っc', 'tc')

INSERT INTO #T VALUES(N'aー', 'aa')
INSERT INTO #T VALUES(N'eー', 'ee')
INSERT INTO #T VALUES(N'iー', 'ii')
INSERT INTO #T VALUES(N'oー', 'oo')
INSERT INTO #T VALUES(N'uー', 'uu')

One Response to Transliterating Kana into Latin alphabet using T-SQL

  1. Bensn says:

    The same here! Thank you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: