Updating YuJisho: a Unicode CJK Character web dictionary

January 17, 2014

I deployed by first version of YuJisho nearly 4 years ago, and, as I developed more and more MVC applications since then, I felt it was time to migrate the original ASP.Net application to ASP.Net MVC.

ASP.Net allowed (supported?) really messy code, so the challenges for an MVC migration are:

  • Extract business logic from the presentation layer to the business layer
  • Re-write the markup from ASP: controls to use native HTML
  • Re-write postbacks as HttpPost actions (both <form> and Ajax requests)

The layout also got a facelift using basic Bootstrap (version 2) styling, but the UI migration is not yet complete.

The data remains unchanged, containing Unicode 5.2, but an upgrade to Unicode 6.3 and the latest dictionary data is in the pipeline.

Enjoy browsing and searching πŸ˜‰

Accessing MediaWiki via JSON API

April 18, 2010

In its first version, YuJisho provided a web search interface to a collection of freely available dictionaries. The obvious extension to that principle is to include other encyclopedias and online dictionaries as well.

MediaWiki wikis not only display their contents in the /wiki/ root directory, but also provide a Query API via the /w/api.php URL. This API provides results in various formats, among them JSON, which is typically used by JavaScript clients.

JavaScript code can query this API to search for article titles in a given wiki. jQuery implements the getJSON() method to asynchronously retrieve results. If more than one request is to be executed, the ajax() method has to used with the parameter mode set to ‘queue’.

Out of all available Wikimedia projects, wikipedia.org and wiktionary.org languages have been selected that are most closely related to CJK characters (Chinese, Japanese, Korean) or for which most translations exist in the data (English, German, French, Russian).

So from now on, if you search on YuJisho (for example: 東京 (Tokyo), εŒ—δΊ¬ (Beijing)), every result page will automatically perform a JavaScript search in various wikis, and provide links to the relevant wiki pages.

YuJisho: a Unicode CJK Character web dictionary

April 8, 2010

Chinese (or Japanese) characters have been fascinating me since I first learned about them in the early 90’s, and I immediately started some small programming projects dealing with this topic, among them a Kanji flash card application, one of my first Windows (3.1) programs.

Every now and again, I visited the websites of Jim Breen and Unicode, downloaded fonts, built a vocabulary trainer, and so on. One of the latest activities was an analysis of the Unicode Han Database.

There are a number of CJK dictionaries on the web, and the main objection I find with most of these websites is that you not only need to specify what you are looking for, but also need to tell the site where to look (e.g English, Japanese, Romaji, transcription method, etc).

I wanted to have a single input line with nothing else, and there should always be some kind of result.

Of course, I had to deal with performance-tuning the search algorithm, and I think it performs pretty well now.

A couple of problems I came across dealing with Far East scripts and Latin in the same SQL Server table:

When you look for a CJK character in an NVARCHAR column using the Latin1_General_CI_AS collation, the character may match any other character in that column. Switching to a collation supporting CJK, such as Chinese_PRC_90_CI_AI, solved the problem.

SQL Server 2000 did not handle surrogate pairs well with the available collation Chinese_PRC_CI_AI. According to this blog by Qingsong Yao, the collation Chinese_PRC_90_CI_AI and related collations of SQL Server 2005 solve the surrogate pair problem.

That all said, here is my online character dictionary, YuJisho. The name is a combination of the U in Unicode and the Japanese word for “dictionary”.

Any feedback is welcome πŸ˜‰