After updating the web application of YuJisho, it was high time to provide the dictionary with current data.
The following data have been imported as of February 2020:
You can browse my online CJK dictionary YuJisho here.
After updating the web application of YuJisho, it was high time to provide the dictionary with current data.
The following data have been imported as of February 2020:
You can browse my online CJK dictionary YuJisho here.
For as long as I can remember, a VARCHAR (or CHAR) was always defined as “1 character equals 1 byte”. Different character sets (code pages) where implemented as COLLATIONs, so that you had basic database support for internationalization.
Then came Unicode, and we got NVARCHAR strings (or NCHAR), where the rule was “1 character equals 2 bytes”, and we could store any text from around the world without bothering with code pages, encodings, etc. The .Net framework brought us the string class with similar features and the world was beautiful.
Then, in 2001, came Unicode 3.1 and needed more space:
For the first time, characters are encoded beyond the original 16-bit codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the forthcoming standard ISO/IEC 10646-2. For further information, see Article IX, Relation to 10646. Unicode 3.1 and 10646-2 define three new supplementary planes.
These additional planes were immediately supported in SQL Server 2012. From now on, using an *_SC collation, NVARCHARs could be 2 or 4 bytes per character.
In C#, the StringInfo class handles supplementary planes, but it seems, they are still a bit behind:
Starting with the .NET Framework 4.6.2, character classification is based on The Unicode Standard, Version 8.0.0. For the .NET Framework 4 through the .NET Framework 4.6.1, it is based on The Unicode Standard, Version 6.3.0. In .NET Core, it is based on The Unicode Standard, Version 8.0.0.
(For the record, the current Unicode version is 12.1, and 13.0 is going to be released soon)
So now SQL Server 2019 supports UTF-8-enabled collations.
A question on SO quoted the documentation as
A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored
(emphasis mine) which confused me a little bit, and the quote continues
The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n.
(emphasis mine).
This got me investigating, and I had a look into this issue. I create a UTF8-enabled database with a table with all kinds of N/VARCHAR columns
CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 CREATE TABLE [dbo].[UTF8Test]( [Id] [int] IDENTITY(1,1) NOT NULL, [VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL, [VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL, [VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL, [NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL, [NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL, [NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL )
I inserted test data from various Unicode ranges
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES ('a','a','a','a','a','a') INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES ('ö','ö','ö',N'ö',N'ö',N'ö') -- U+56D7 INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗') -- U+2000B INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')
then selected the lengths and data lengths of each text field
SELECT TOP (1000) [Id] ,[VarcharText],[VarcharTextSC],[VarcharUTF8] ,[NVarcharText],[NVarcharTextSC],[NVarcharUTF8] FROM [test-sc].[dbo].[UTF8Test] SELECT TOP (1000) [Id] ,LEN([VarcharText]) VT,LEN([VarcharTextSC]) VTSC ,LEN([VarcharUTF8]) VU ,LEN([NVarcharText]) NVT,LEN([NVarcharTextSC]) NVTSC ,LEN([NVarcharUTF8]) NVU FROM [test-sc].[dbo].[UTF8Test] SELECT TOP (1000) [Id] ,DATALENGTH([VarcharText]) VT,DATALENGTH([VarcharTextSC]) VTSC ,DATALENGTH([VarcharUTF8]) VU ,DATALENGTH([NVarcharText]) NVT,DATALENGTH([NVarcharTextSC]) NVTSC ,DATALENGTH([NVarcharUTF8]) NVU FROM [test-sc].[dbo].[UTF8Test]
I was surprised to find that the old mantra “a VARCHAR only stores single byte characters” needs to be revised when using UTF8 collations.
Note that only table columns are associated with collations, but not T-SQL variables, as you cannot declare a collation on a variable
SELECT @VarcharText = [VarcharText],@NVarcharText = [NVarcharText] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText) , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText) SELECT @VarcharText = [VarcharTextSC], @NVarcharText = [NVarcharTextSC] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText) , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText) SELECT @VarcharText = [VarcharUTF8], @NVarcharText = [NVarcharUTF8] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText) , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)
My online CJK dictionary YuJisho got a facelift again – this time, from ASP.Net MVC3 to the current MVC 5, and from Bootstrap 2 to Bootstrap 4.
I hope this gets rid of the messages in the Google Search Console 😉 :
There is a little change though: In times of GDPR &co, queries to Wikipedias and Wiktionaries need to invoked by clicking the “Query Wikipedia” button, rather than querying automatically.
click the button “Query Wikipedia”
results in links to various Wikipedias containing an article
If your browser / operating system fails to display certain Chinese characters, there is now a button “Load Glyphs” which tries to load the unsupported characters’ images as .svg from GlyphWiki.
Please check the About page for more information.
I had to implement a good/bad/neutral selection UI in an ASP.Net MVC application.
This simple task turned out to be rather cumbersome, given different browsers implementing different things differently, and the MVC framework restricting access to generated markup.
I first thought about a DropDownList (i.e.) rendering a set of emoji, only to find
new SelectListItem {Text = "\u1F60A" }
)
However, I found code to overcome the limitations of SelectListItem, either by copying and extending code from the MS MVC framework, or by XML manipulation of the built-in HTML generator.
Maybe the dropdown list was just the wrong path to follow, so I searched for ways to style radiobuttons, and found this nice demo on codepen.
I modified the demo to fit my needs, and voilà, here’s my first codepen:
A radiobutton-based emoji selector
Update: Apparently, IE does not render an <img> inside a <label> clickable, unless you specify pointer-events: none.
Some scripts supported by the Unicode standard define combining characters, which may cause confusion for people not familiar with a specific script:
For one of these questions, I even analyzed the character sequence manually. But this analysis is not much fun if you try perform it repeatedly.
Recently I stumbled upon the SE user name n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳, so I had the idea to write a small program to output the Unicode code points and character names for a given input, based on Unicode’s UnicodeData.txt file.
The output for the text samples in the above links looks like this:
The initial version of this program is available for download here.
I deployed by first version of YuJisho nearly 4 years ago, and, as I developed more and more MVC applications since then, I felt it was time to migrate the original ASP.Net application to ASP.Net MVC.
ASP.Net allowed (supported?) really messy code, so the challenges for an MVC migration are:
The layout also got a facelift using basic Bootstrap (version 2) styling, but the UI migration is not yet complete.
The data remains unchanged, containing Unicode 5.2, but an upgrade to Unicode 6.3 and the latest dictionary data is in the pipeline.
Enjoy browsing and searching 😉
This is a follow-up article on my previous post about string equality and collations.
We know, simply from looking at them, that the characters 'ܐ'
and 'አ'
are different, but the collation you use may treat them as equal.
Well, we can still compare their code point value by using the UNICODE() function, as in
select unicode(N'ܐ'), unicode(N'አ')
returning 1808 and 4768.
The reason I write this is because I discovered a fallacy in a comment on SO, resulting from mixing Unicode and non-Unicode literals and functions.
Take the statement
select ascii('ܐ'), ascii('አ')
Note that the Unicode characters are not given as Unicode strings (N” notation), but as non-Unicode strings.
Since both characters cannot be mapped onto Latin ASCII (or whatever your collation is), they are replaced by a Replacement Character, which is the question mark ‘?’ in ASCII.
Wikipedia tells us
The question mark character is also often used in place of missing or unknown data. In Unicode, it is encoded at U+003F ? question mark (HTML:
?
).
and
In many web browsers and other computer programs, “?” is used to show a character not found in the program’s character set. […] Some fonts will instead use the Unicode Replacement Glyph (U+FFFD, �), which is commonly rendered as a white question mark in a black diamond (see replacement character).
So we can see where the question mark comes from, and thus both functions return 63.
In a similar, but nonetheless different case
select ascii(N'ܐ'), ascii(N'አ')
the characters are defined as Unicode strings, but passed to a function that only accepts non-Unicode strings. In this case, the mapping according to the current collation is performed by the ASCII() function, again resulting in the value 63.
As for the Unicode Replacement Character, you’ll encounter them if you decode a byte array to Unicode, and the decoder encounters a byte sequence that cannot be converted to Unicode given the selected encoding.
I recently came across this interesting question on SO:
Why does the statement
select case when N'ܐܪܡܝܐ' = N'አማርኛ' then 1 else 0 end
return 1?
In case you are wondering, the first string seems to be Syriac/Aramaic for Aramaic, and the second string is Amharic for Amharic, as found on Wikipedia.
Obviously, the strings are not equal, but still the result of the comparison is true. But why?
The reason is, of course, that string comparison and sorting in SQL Server depend on the collation in use. And the collation used in the example was SQL_Latin1_General_CP1_CI_AS, which does not distinguish Aramaic and Amharic characters.
If you just want to test whether two strings are equal as in “equal Code Points” or “binary equal”, you need to select a specific collation which distinguishes the code points, such as Latin1_General_BIN:
select case
when N'ܐܪܡܝܐ' COLLATE Latin1_General_BIN = N'አማርኛ' COLLATE Latin1_General_BIN
then 1 else 0 end
So how can we analyze how collations affect comparison?
First, I created two tables: One stores records about character groups, the second stores the characters in each group that are considered equal by the collation:
CREATE TABLE [dbo].[CollationString]( [Id] [int] IDENTITY(1,1) NOT NULL, [Collation] [varchar](100) NOT NULL, [String] [nvarchar](50) NOT NULL, [StringInt] [int] NULL, CONSTRAINT [PK_CollationString] PRIMARY KEY CLUSTERED ([Id] ASC)) GO CREATE TABLE [dbo].[CollationStringEquivalent]( [Id] [int] IDENTITY(1,1) NOT NULL, [CollationStringId] [int] NOT NULL, [String] [nvarchar](50) NOT NULL, [StringInt] [int] NULL, CONSTRAINT [PK_CollationStringEquivalent] PRIMARY KEY CLUSTERED ([Id] ASC)) GO
I added a field Collation so that different collations can be compared later on.
Next, we iterate through all characters in the Unicode BMP, look up the character, and insert according to the result:
set nocount on declare @c int = 0 declare @coll nvarchar(50) = 'Latin1_General_CI_AS' declare @cid int while @c < 65536 begin print @c set @cid = null; select @cid = id from collationstring where collation collate Latin1_General_CI_AS = @coll collate Latin1_General_CI_AS and string = nchar(@c); if @cid is null begin insert into collationstring(collation, string, stringint) values (@coll, NCHAR(@c), @c); select @cid = SCOPE_IDENTITY() end insert into collationstringequivalent(collationstringid, string, stringint) values (@cid, NCHAR(@c), @c); set @c = @c + 1 end
After running this script, we can now query which characters are considered different
SELECT * FROM [CollationString]
and which characters are equal
SELECT * FROM [CollationStringEquivalent] ORDER BY [CollationStringId], [StringInt]
To analyze the distribution of characters in one collation, we can start with a query like this
SELECT COUNT([Id]), [CollationStringId], MIN([String]), MIN([StringInt]) FROM [CollationStringEquivalent] GROUP BY collationstringid ORDER BY COUNT(id) DESC, MIN(stringint)
This article deals with Unicode Normalization, Composition, and Decomposition. If you are new to the topic, please read the introduction
Unicode defines 4 Normalization Forms (source: Unicode)
Form | Description |
---|---|
Normalization Form D (NFD) | Canonical Decomposition |
Normalization Form C (NFC) | Canonical Decomposition, followed by Canonical Composition |
Normalization Form KD (NFKD) | Compatibility Decomposition |
Normalization Form KC (NFKC) | Compatibility Decomposition, followed by Canonical Composition |
which may result in different sequences of Unicode Code Points, depending on the chosen normalization, but still considered equal according to the Unicode (source: Wikipedia)
NFC character | A | m | é | l | i | e | |
---|---|---|---|---|---|---|---|
NFC code point | 0041 | 006d | 00e9 | 006c | 0069 | 0065 | |
NFD code point | 0041 | 006d | 0065 | 0301 | 006c | 0069 | 0065 |
NFD character | A | m | e | ◌́ | l | i | e |
Lets iterate through all code points in the BMP and collect the lengths of their normalizations in all forms:
foreach (var nf in new[] { NormalizationForm.FormC, NormalizationForm.FormD, NormalizationForm.FormKC, NormalizationForm.FormKD }) { var chars = new Dictionary<int, List<string>>(); chars.Add(0, new List<string>()); for (var i = 0; i < 65536; i++) { var s = new string((char)i, 1); try { var l = s.Normalize(nf).Length; if (!chars.ContainsKey(l)) chars.Add(l, new List<string>()); chars[l].Add(s); } catch { chars[0].Add(s); } } Console.WriteLine(nf.ToString()); foreach (var kv in chars.OrderBy(d => d.Key)) { Console.WriteLine("length " + kv.Key + " count " + kv.Value.Count); } }
This results in the output
FormC
length 0 count 2082
length 1 count 63376
length 2 count 76
length 3 count 2
FormD
length 0 count 2082
length 1 count 51258
length 2 count 1167
length 3 count 10993
length 4 count 36
FormKC
length 0 count 2082
length 1 count 62290
length 2 count 692
length 3 count 401
length 4 count 52
length 5 count 15
length 6 count 2
length 8 count 1
length 18 count 1
FormKD
length 0 count 2082
length 1 count 50151
length 2 count 1750
length 3 count 11412
length 4 count 108
length 5 count 16
length 6 count 14
length 7 count 1
length 8 count 1
length 18 count 1
Wait, 18??
To find out what’s going on, I rewrote the code as WinForm application to get support for Unicode fonts.
As it turns out, the longest sequences depending on the selected Normalization Form are:
nothing truly spectacular, just two code points resulting in length 3
שּׁ (U+fb2c): 3, ש (U+05e9), ּ (U+05bc), ׁ (U+05c1)
שּׂ (U+fb2d): 3, ש (U+05e9), ּ (U+05bc), ׂ (U+05c2)
length 4 for some characters in the Greek Extended Block.
length 4, 5, 6: denormalization of Roman numerals, parenthesized numbers, and Japanese and Latin SQUARE abbreviations (such as units of measurement, etc.)
Ⅷ (U+2167): 4, V (U+0056), I (U+0049), I (U+0049), I (U+0049)
⑽ (U+247d): 4, ( (U+0028), 1 (U+0031), 0 (U+0030), ) (U+0029)
㌳ (U+3333): 4, フ (U+30d5), ィ (U+30a3), ー (U+30fc), ト (U+30c8)
length 8: ‘ARABIC LIGATURE JALLAJALALOUHOU’ (U+FDFB)
ﷻ (U+fdfb): 8, ج (U+062c), ل (U+0644), (U+0020), ج (U+062c), ل (U+0644), ا (U+0627), ل (U+0644), ه (U+0647)
length 18: ‘ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM’ (U+FDFA)
ﷺ (U+fdfa): 18, ص (U+0635), ل (U+0644), ى (U+0649), (U+0020), ا (U+0627), ل (U+0644), ل (U+0644), ه (U+0647), (U+0020), ع (U+0639), ل (U+0644), ي (U+064a), ه (U+0647), (U+0020), و (U+0648), س (U+0633), ل (U+0644), م (U+0645)
length 4, 5, 6: Again Greek, Parenthesized Numerals and Hangul, SQUARE Japanese and Latin abbreviations
length 7: ‘PARENTHESIZED KOREAN CHARACTER OJEON’ (U+321D)
length 8 as under FormKC
Software often implements more features than described in standards. Have a look at this SO question on the abuse of COMBINING marks to see one result, and my explanation.
ด้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้дด็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้
A user of SMOscript sent me a bug report:
Some of our stored procedures have apostrophes in comments. If we export our DB and enable the “if not exists” option, our scripts become invalid, as the commented apostrophe ends the string.
He suspected SMOscript might encode the SP code incorrectly, but I made sure that was not the case, so I asked for a (redacted) sample of such a stored procedure.
I added the SP to a database, and in SSMS ran Generate Script on the procedure both with and without the IF NOT EXISTS option. In “plain” mode this simply generates the CREATE PROCEDURE statement, whereas in the IF NOT EXISTS case it will generate the IF NOT EXISTS check, and then create the SP using sp_executesql and dynamic SQL (and some people do not like this very much).
Next, I ran SMOscript with the -i switch, which activates the IF NOT EXISTS mode:
>smoscript.exe -s localhost -d mydatabase -o fooSP -i > fooSP.sql
When I opened the generated file in SSMS, and indeed, the lines that originally contained an apostrophe not contained a quote which ended the string. But I also noted that other quote characters got correctly escaped using 2 quotes.
Then it struck me: the email mentioned apostrophes, but what I saw here was quotes!
I opened the original file in Notepad++ in hex mode, and there it was: e2 80 99
, the UTF-8 encoding for U+2019, formally named RIGHT SINGLE QUOTATION MARK, but apostrophe among friends 😉
Given its code point, it is obvious that this character is neither in the ASCII nor in the ASCII character set, so SMOscript has to generate Unicode or UTF-8 encoding.
Fortunately, this functionality is already built-in: use -T for Unicode, or -U for UTF-8 encoding:
>smoscript.exe -s localhost -d mydatabase -o fooSP -i -U -f .\fooSP.sql
Note that I use the -f switch to directly write the output to a file, since the console might not be able to display Unicode characters, or it might, and I’m not smart enough 😉 Anyway, directly creating the file from SMOscript frees you from the hassles of piping “funny” characters.
In hindsight, what happened was that SMOscript wrote the output to the console (via piping), and, using the native code page, the console was not able to correctly output the apostrophe character, and replaced it with the quote character, thus breaking the string parameter for sp_executesql.
You should always be aware that if you translate between code pages, not every character can be mapped to a character in the target code page, and some mappings will cause difficulties.
Fortunately, this problem was quite easy to solve.