Updating YuJisho Data Sources

February 25, 2020

After updating the web application of YuJisho, it was high time to provide the dictionary with current data.

The following data have been imported as of February 2020:

You can browse my online CJK dictionary YuJisho here.


Length of UTF-8 VARCHAR in SQL Server

February 21, 2020

Foreword

For as long as I can remember, a VARCHAR (or CHAR) was always defined as “1 character equals 1 byte”. Different character sets (code pages) where implemented as COLLATIONs, so that you had basic database support for internationalization.

Then came Unicode, and we got NVARCHAR strings (or NCHAR), where the rule was “1 character equals 2 bytes”, and we could store any text from around the world without bothering with code pages, encodings, etc. The .Net framework brought us the string class with similar features and the world was beautiful.

Then, in 2001, came Unicode 3.1 and needed more space:

For the first time, characters are encoded beyond the original 16-bit codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the forthcoming standard ISO/IEC 10646-2. For further information, see Article IX, Relation to 10646. Unicode 3.1 and 10646-2 define three new supplementary planes.

These additional planes were immediately supported in SQL Server 2012. From now on, using an *_SC collation, NVARCHARs could be 2 or 4 bytes per character.

In C#, the StringInfo class handles supplementary planes, but it seems, they are still a bit behind:

Starting with the .NET Framework 4.6.2, character classification is based on The Unicode Standard, Version 8.0.0. For the .NET Framework 4 through the .NET Framework 4.6.1, it is based on The Unicode Standard, Version 6.3.0. In .NET Core, it is based on The Unicode Standard, Version 8.0.0.

(For the record, the current Unicode version is 12.1, and 13.0 is going to be released soon)

UTF-8 Collations

So now SQL Server 2019 supports UTF-8-enabled collations.

A question on SO quoted the documentation as

A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored

(emphasis mine) which confused me a little bit, and the quote continues

The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n.

(emphasis mine).

This got me investigating, and I had a look into this issue. I create a UTF8-enabled database with a table with all kinds of N/VARCHAR columns

CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8

CREATE TABLE [dbo].[UTF8Test](
  [Id] [int] IDENTITY(1,1) NOT NULL,
  [VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL,
  [VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
  [VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL,
  [NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL,
  [NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
  [NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL
)

I inserted test data from various Unicode ranges

INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES ('a','a','a','a','a','a')
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES ('ö','ö','ö',N'ö',N'ö',N'ö')
-- U+56D7
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗')
-- U+2000B
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')

then selected the lengths and data lengths of each text field

SELECT TOP (1000) [Id]
    ,[VarcharText],[VarcharTextSC],[VarcharUTF8]
    ,[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
    ,LEN([VarcharText]) VT,LEN([VarcharTextSC]) VTSC
    ,LEN([VarcharUTF8]) VU
    ,LEN([NVarcharText]) NVT,LEN([NVarcharTextSC]) NVTSC
    ,LEN([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
    ,DATALENGTH([VarcharText]) VT,DATALENGTH([VarcharTextSC]) VTSC
    ,DATALENGTH([VarcharUTF8]) VU
    ,DATALENGTH([NVarcharText]) NVT,DATALENGTH([NVarcharTextSC]) NVTSC
    ,DATALENGTH([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]

Select Lengths.png

I was surprised to find that the old mantra “a VARCHAR only stores single byte characters” needs to be revised when using UTF8 collations.

Table data only

Note that only table columns are associated with collations, but not T-SQL variables, as you cannot declare a collation on a variable

SELECT @VarcharText = [VarcharText],@NVarcharText = [NVarcharText]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

SELECT @VarcharText = [VarcharTextSC], @NVarcharText = [NVarcharTextSC]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

SELECT @VarcharText = [VarcharUTF8], @NVarcharText = [NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

 

Select Variable Lengths.png

 


Updating YuJisho: a Unicode CJK Character web dictionary

February 17, 2020

My online CJK dictionary YuJisho got a facelift again – this time, from ASP.Net MVC3 to the current MVC 5, and from Bootstrap 2 to Bootstrap 4.

I hope this gets rid of the messages in the Google Search Console 😉 :

  •  Text too small to read
  • Clickable elements too close together
  • Viewport not set

There is a little change though: In times of GDPR &co, queries to Wikipedias and Wiktionaries need to invoked by clicking the “Query Wikipedia” button, rather than querying automatically.

click the button

click the button “Query Wikipedia”

results in links to various Wikipedias containing an article

If your browser / operating system fails to display certain Chinese characters, there is now a button “Load Glyphs” which tries to load the unsupported characters’ images as .svg from GlyphWiki.

after load glyphs

after load glyphs u

Please check the About page for more information.


Emoticon Selection in ASP.Net MVC

March 30, 2017

I had to implement a good/bad/neutral selection UI in an ASP.Net MVC application.

This simple task turned out to be rather cumbersome, given different browsers implementing different things differently, and the MVC framework restricting access to generated markup.

I first thought about a DropDownList (i.e.) rendering a set of emoji, only to find

  • you cannot pass a C# emoji as SelectListItem.Text
new SelectListItem {Text = "\u1F60A" }
  • you cannot add the emoji HTML entity 😊 astext (at least not in Chrome)
  • you cannot add a @class parameter to SelectListItem
  • you cannot add a background-image:url() definition to an
  • you cannot add HTML tags inside an(

    )

However, I found code to overcome the limitations of SelectListItem, either by copying and extending code from the MS MVC framework, or by XML manipulation of the built-in HTML generator.

Maybe the dropdown list was just the wrong path to follow, so I searched for ways to style radiobuttons, and found this nice demo on codepen.

I modified the demo to fit my needs, and voilà, here’s my first codepen:

A radiobutton-based emoji selector

Update: Apparently, IE does not render an <img> inside a <label> clickable, unless you specify pointer-events: none.


Analyzing Combining Unicode Characters

March 23, 2014

Some scripts supported by the Unicode standard define combining characters, which may cause confusion for people not familiar with a specific script:

For one of these questions, I even analyzed the character sequence manually. But this analysis is not much fun if you try perform it repeatedly.

Recently I stumbled upon the SE user name n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳, so I had the idea to write a small program to output the Unicode code points and character names for a given input, based on Unicode’s UnicodeData.txt file.

The output for the text samples in the above links looks like this:

unispell 1

unispell 2

unispell 3

unispell 4

unispell nhahtdh

The initial version of this program is available for download here.


Updating YuJisho: a Unicode CJK Character web dictionary

January 17, 2014

I deployed by first version of YuJisho nearly 4 years ago, and, as I developed more and more MVC applications since then, I felt it was time to migrate the original ASP.Net application to ASP.Net MVC.

ASP.Net allowed (supported?) really messy code, so the challenges for an MVC migration are:

  • Extract business logic from the presentation layer to the business layer
  • Re-write the markup from ASP: controls to use native HTML
  • Re-write postbacks as HttpPost actions (both <form> and Ajax requests)

The layout also got a facelift using basic Bootstrap (version 2) styling, but the UI migration is not yet complete.

The data remains unchanged, containing Unicode 5.2, but an upgrade to Unicode 6.3 and the latest dictionary data is in the pipeline.

Enjoy browsing and searching 😉


ASCII(), UNICODE() and the Replacement Character

June 3, 2013

This is a follow-up article on my previous post about string equality and collations.

We know, simply from looking at them, that the characters 'ܐ' and 'አ' are different, but the collation you use may treat them as equal.

Well, we can still compare their code point value by using the UNICODE() function, as in

select unicode(N'ܐ'), unicode(N'አ')

returning 1808 and 4768.

The reason I write this is because I discovered a fallacy in a comment on SO, resulting from mixing Unicode and non-Unicode literals and functions.

Take the statement

select ascii('ܐ'), ascii('አ')

Note that the Unicode characters are not given as Unicode strings (N” notation), but as non-Unicode strings.

Since both characters cannot be mapped onto Latin ASCII (or whatever your collation is), they are replaced by a Replacement Character, which is the question mark ‘?’ in ASCII.

Wikipedia tells us

The question mark character is also often used in place of missing or unknown data. In Unicode, it is encoded at U+003F ? question mark (HTML: ?).

and

In many web browsers and other computer programs, “?” is used to show a character not found in the program’s character set. […] Some fonts will instead use the Unicode Replacement Glyph (U+FFFD, �), which is commonly rendered as a white question mark in a black diamond (see replacement character).

So we can see where the question mark comes from, and thus both functions return 63.

In a similar, but nonetheless different case

select ascii(N'ܐ'), ascii(N'አ')

the characters are defined as Unicode strings, but passed to a function that only accepts non-Unicode strings. In this case, the mapping according to the current collation is performed by the ASCII() function, again resulting in the value 63.

As for the Unicode Replacement Character, you’ll encounter them if you decode a byte array to Unicode, and the decoder encounters a byte sequence that cannot be converted to Unicode given the selected encoding.