After updating the web application of YuJisho, it was high time to provide the dictionary with current data.
The following data have been imported as of February 2020:
You can browse my online CJK dictionary YuJisho here.
I hope this gets rid of the messages in the Google Search Console 😉 :
There is a little change though: In times of GDPR &co, queries to Wikipedias and Wiktionaries need to invoked by clicking the “Query Wikipedia” button, rather than querying automatically.
If your browser / operating system fails to display certain Chinese characters, there is now a button “Load Glyphs” which tries to load the unsupported characters’ images as .svg from GlyphWiki.
Please check the About page for more information.
I deployed by first version of YuJisho nearly 4 years ago, and, as I developed more and more MVC applications since then, I felt it was time to migrate the original ASP.Net application to ASP.Net MVC.
ASP.Net allowed (supported?) really messy code, so the challenges for an MVC migration are:
The data remains unchanged, containing Unicode 5.2, but an upgrade to Unicode 6.3 and the latest dictionary data is in the pipeline.
Enjoy browsing and searching 😉
In its first version, YuJisho provided a web search interface to a collection of freely available dictionaries. The obvious extension to that principle is to include other encyclopedias and online dictionaries as well.
Out of all available Wikimedia projects, wikipedia.org and wiktionary.org languages have been selected that are most closely related to CJK characters (Chinese, Japanese, Korean) or for which most translations exist in the data (English, German, French, Russian).
Chinese (or Japanese) characters have been fascinating me since I first learned about them in the early 90’s, and I immediately started some small programming projects dealing with this topic, among them a Kanji flash card application, one of my first Windows (3.1) programs.
Every now and again, I visited the websites of Jim Breen and Unicode, downloaded fonts, built a vocabulary trainer, and so on. One of the latest activities was an analysis of the Unicode Han Database.
There are a number of CJK dictionaries on the web, and the main objection I find with most of these websites is that you not only need to specify what you are looking for, but also need to tell the site where to look (e.g English, Japanese, Romaji, transcription method, etc).
I wanted to have a single input line with nothing else, and there should always be some kind of result.
Of course, I had to deal with performance-tuning the search algorithm, and I think it performs pretty well now.
A couple of problems I came across dealing with Far East scripts and Latin in the same SQL Server table:
When you look for a CJK character in an NVARCHAR column using the Latin1_General_CI_AS collation, the character may match any other character in that column. Switching to a collation supporting CJK, such as Chinese_PRC_90_CI_AI, solved the problem.
SQL Server 2000 did not handle surrogate pairs well with the available collation Chinese_PRC_CI_AI. According to this blog by Qingsong Yao, the collation Chinese_PRC_90_CI_AI and related collations of SQL Server 2005 solve the surrogate pair problem.
That all said, here is my online character dictionary, YuJisho. The name is a combination of the U in Unicode and the Japanese word for “dictionary”.
Any feedback is welcome 😉
As of version 5.1, Unicode contains 71.234 CJK character and a total of 1.1 million character field values.
The Unihan database groups character fields into Field Types. For each field type below, the fields, the (assumed) language, and the number of characters having a value for that field are listed.
The English gloss