Chinese (or Japanese) characters have been fascinating me since I first learned about them in the early 90’s, and I immediately started some small programming projects dealing with this topic, among them a Kanji flash card application, one of my first Windows (3.1) programs.
Every now and again, I visited the websites of Jim Breen and Unicode, downloaded fonts, built a vocabulary trainer, and so on. One of the latest activities was an analysis of the Unicode Han Database.
There are a number of CJK dictionaries on the web, and the main objection I find with most of these websites is that you not only need to specify what you are looking for, but also need to tell the site where to look (e.g English, Japanese, Romaji, transcription method, etc).
I wanted to have a single input line with nothing else, and there should always be some kind of result.
Of course, I had to deal with performance-tuning the search algorithm, and I think it performs pretty well now.
A couple of problems I came across dealing with Far East scripts and Latin in the same SQL Server table:
When you look for a CJK character in an NVARCHAR column using the Latin1_General_CI_AS collation, the character may match any other character in that column. Switching to a collation supporting CJK, such as Chinese_PRC_90_CI_AI, solved the problem.
SQL Server 2000 did not handle surrogate pairs well with the available collation Chinese_PRC_CI_AI. According to this blog by Qingsong Yao, the collation Chinese_PRC_90_CI_AI and related collations of SQL Server 2005 solve the surrogate pair problem.
That all said, here is my online character dictionary, YuJisho. The name is a combination of the U in Unicode and the Japanese word for “dictionary”.
Any feedback is welcome 😉
Pingback: Accessing MediaWiki via JSON API « devioblog
Pingback: Open Data « devioblog
Pingback: Updating YuJisho: a Unicode CJK Character web dictionary | devioblog