Writing “Funny Characters”? Make sure you use the correct Encoding!

January 14, 2013

A user of SMOscript sent me a bug report:

Some of our stored procedures have apostrophes in comments. If we export our DB and enable the “if not exists” option, our scripts become invalid, as the commented apostrophe ends the string.

He suspected SMOscript might encode the SP code incorrectly, but I made sure that was not the case, so I asked for a (redacted) sample of such a stored procedure.

I added the SP to a database, and in SSMS ran Generate Script on the procedure both with and without the IF NOT EXISTS option. In “plain” mode this simply generates the CREATE PROCEDURE statement, whereas in the IF NOT EXISTS case it will generate the IF NOT EXISTS check, and then create the SP using sp_executesql and dynamic SQL (and some people do not like this very much).

Next, I ran SMOscript with the -i switch, which activates the IF NOT EXISTS mode:

>smoscript.exe -s localhost -d mydatabase -o fooSP -i > fooSP.sql

When I opened the generated file in SSMS, and indeed, the lines that originally contained an apostrophe not contained a quote which ended the string. But I also noted that other quote characters got correctly escaped using 2 quotes.

Then it struck me: the email mentioned apostrophes, but what I saw here was quotes!

I opened the original file in Notepad++ in hex mode, and there it was: e2 80 99, the UTF-8 encoding for U+2019, formally named RIGHT SINGLE QUOTATION MARK, but apostrophe among friends ;)

Given its code point, it is obvious that this character is neither in the ASCII nor in the ASCII character set, so SMOscript has to generate Unicode or UTF-8 encoding.

Fortunately, this functionality is already built-in: use -T for Unicode, or -U for UTF-8 encoding:

>smoscript.exe -s localhost -d mydatabase -o fooSP -i -U -f .\fooSP.sql

Note that I use the -f switch to directly write the output to a file, since the console might not be able to display Unicode characters, or it might, and I’m not smart enough ;) Anyway, directly creating the file from SMOscript frees you from the hassles of piping “funny” characters.

In hindsight, what happened was that SMOscript wrote the output to the console (via piping), and, using the native code page, the console was not able to correctly output the apostrophe character, and replaced it with the quote character, thus breaking the string parameter for sp_executesql.

You should always be aware that if you translate between code pages, not every character can be mapped to a character in the target code page, and some mappings will cause difficulties.

Fortunately, this problem was quite easy to solve.


“Why can’t I store/display [insert name] language data in SQL Server?”

September 17, 2012

I came across this and similar questions so many times on Stack Overflow:

My guess is that most of these question are due to a lack of understanding of Unicode and how the various components of an application must work together to display Unicode texts correctly.

Storage

If you store Unicode data in the database, you have to use the Unicode data types NVARCHAR(n), NVARCHAR(MAX), NTEXT (obsolete) or NCHAR (fixed-length strings). Unicode data is stored in SQL Server using the UCS-2 encoding, meaning every character uses 2 bytes to store, unless the database is created using an Supplementary Characters collation, causing the database to use the UTF-16 encoding (same encoding for the BMP, but support for all Unicode planes).

Except for the _SC collations, defining a collation does not have an influence on how Unicode data is stored.

In general, a collation defines how text data is compared and sorted. The collation can be set on database level (default for all text columns), on column level, and on operand level (i.e. the value of a comparison or ORDER BY clause) using the COLLATE clause.

Unicode Values

Unicode values can either be variables (or parameters) declared with the Unicode data types listed above, or a Unicode string literals using the N” notation. Most of the times, the questions above arise from either writing literals without the N”, or from believing that variable names must be attached the N somehow.

Let’s look at some examples:

  • SELECTing string literals
SELECT N'тест', 'тест'
тест    ????
  • Assigning (non-)Unicode values
declare @nvc1 nvarchar(50), @nvc2 nvarchar(50)
declare @vc1 varchar(50), @vc2 varchar(50)

set @nvc1 = N'тест'
set @nvc2 = 'тест'
set @vc1 = N'тест'
set @vc2 = 'тест'

print @nvc1
print @nvc2
print @vc1
print @vc2
тест
????
????
????

The Unicode value only “survives” if every assignment is Unicode-enabled.

  • Storing from C#

.Net by definition supports Unicode. If you store data into or retrieve data from a Unicode-enabled column using either SqlCommand or a typical DAL/ORM component, you should not even notice that you are dealing with Unicode rather than plain ASCII.

Just be aware that in case of SqlCommand, you better use parameterized statements and SqlParameter rather than concatenating the complete statement including parameter values. (This approach is also most common to avoid SQL injections)

Displaying Unicode data

Next problem: You successfully stored Unicode data, but SELECTing in SSMS only returns lots of question marks or boxes. The reason is simple: SSMS cannot display the characters.

But what is required actually necessary to display Unicode data:

  • a Unicode font covering the characters you want to retrieve
  • a font rendering engine as part of the operating system capable of rendering the character string correctly (especially for right-to-left languages, and some South-East Asian languages)
  • an application that makes use of both

I know from my experience that SSMS 2005 was not capable by default to display some Unicode characters (sorry, can’t remember which) even though fonts were installed. It probably had to do with the fonts defined for the editor or the result pane.

As a fact, SSMS 2008 on Windows 7 does not have any problems in the results pane, and can even display supplementary characters.

If you still have problems display Unicode characters, make sure to check whether the data is stored correctly by analyzing the numerical values of the Unicode string using the UNICODE function:

declare @t nvarchar(4000)
select @t = SomeUnicodeColumn from MyTable where ...
select unicode(substring( @t, 1, 1))
select unicode(substring( @t, 2, 2))
select unicode(substring( @t, 3, 3))
select unicode(substring( @t, 4, 4))

If the values are greater than 128 (or 256, depending on language), your data is stored as Unicode.


Unicode Versions supported in SQL Server (2000-2012)

May 30, 2012

Update to a previous post on Unicode support in SQL Server.

Which version of the Unicode Standard is supported by MS SQL Server?

SQL Server Version Unicode Version
2000 3.0
2005 3.2
2008 5.0
2008 R2 5.0
2012 5.0*

* The documentation on SQL Server 2012 does not explicitly refer to a Unicode version number. Even if there was no change, SQL Server 2012 introduced native support for supplementary characters.


Supplementary Characters in C#

May 20, 2012

Inspired by my posting about supplementary characters in SQL Server 2012, I checked the capabilities of .Net to support supplementary characters:

string s143743 = "𣅿";
Console.WriteLine(s143743.Length.ToString());
Console.WriteLine(new StringInfo(s143743).LengthInTextElements.ToString());
string s131518 = "𠆾𠇀𠇃";
Console.WriteLine(s131518.Length.ToString());
Console.WriteLine(new StringInfo(s131518).LengthInTextElements.ToString());

The results are

“𣅿” 
Length: 2
LengthInTextElements: 1
“𠆾𠇀𠇃”
Length: 6 
LengthInTextElements: 3

As you can see, the string.Length property returns the number of 16-bit words (= .Net char structure) the string contains. To handle supplementary characters correctly in .Net, you need to use the StringInfo class.


Supplementary Characters in SQL Server

May 13, 2012

I originally intended to write an update to my post Collation Support in SQL Server Versions to include coverage of SQL Server 2012 collations. I was surprised to find that the number of supported collation names (reduced by the various suffixes such as _CI, _CS, etc.) did not change from SQL Server 2008.

What did change though was that there is a new collation suffix named “_SC” which stands for Supplementary Character support.

Once you declare a database with an _SC collation, the various string functions in SQL Server are not only Unicode-aware, but also Supplementary Character-aware (NCHAR, UNICODE, LEN).

Compare these statements in SQL Server 2012

CREATE DATABASE test_sc COLLATE  Latin1_General_100_CI_AI_SC; 
-- 2012 collation

DECLARE @d nvarchar(10)  = N'𣅿' ;
SELECT NCHAR(0xD84C) + NCHAR(0xDD7F); 
       -- Old style method using surrogate characters
SELECT NCHAR(143743), 
       UNICODE(@d), NCHAR(UNICODE(@d)), LEN(@d), DATALENGTH(@d)

SET @d = N'𠆾𠇀𠇃';
SELECT @d, UNICODE(@d), NCHAR(UNICODE(@d)), LEN(@d), DATALENGTH(@d)

resulting in

𣅿

𣅿       143743    𣅿    1    4

𠆾𠇀𠇃    131518    𠆾    3    12

with the results on a non-SC database in SQL Server 2012 or earlier

𣅿

NULL     55372    �    2    4

𠆾𠇀𠇃    55360    �    6    12

Example data taken from MSDN NCHAR() and SolidQ.

(Note that Management Studio 2008 does not display the characters in the query window, but does correctly show them in the results window)

You find that the functions NCHAR, UNICODE, and LEN support characters for code-points outside the Unicode BMP, whereas non-SC collations handle supplementary characters using the surrogate characters. For a list of changes in semantics, see MSDN.


The Unicode Video

April 14, 2011

I just came across this video on YouTube showing every displayable (with some restrictions) character in the Unicode BMP. (1 character per frame)

Next, the Unicode code point of the day with Wikipedia links to corresponding alphabet or language, which of course reminds me of the series Every character has a story.

Update 11/04/18:

BabelStone comments on the video and hosts a page called Unicode 6.0 Slide Show implemented in JavaScript. Warning: Since the browser displays the Unicode characters, you need to have the required fonts installed on your machine.


Setting Console Encoding to Unicode

April 4, 2011

When you try to set console (input or output) encoding to Unicode or UTF32

Console.OutputEncoding = Encoding.Unicode;

you receive an error message

The parameter is incorrect

To set the console’s encoding, you need to retrieve stdin or stdout, create a new StreamReader with the Encoding, and assign it to the console using SetIn() or SetOut():

var s = Console.OpenStandardInput();
Console.SetIn(new StreamReader(s, inenc));

StreamWriter requires the AutoFlush property set to true:

var s = Console.OpenStandardOutput();
Console.SetOut(new StreamWriter(s, outenc) { AutoFlush = true });

Accessing MediaWiki via JSON API

April 18, 2010

In its first version, YuJisho provided a web search interface to a collection of freely available dictionaries. The obvious extension to that principle is to include other encyclopedias and online dictionaries as well.

MediaWiki wikis not only display their contents in the /wiki/ root directory, but also provide a Query API via the /w/api.php URL. This API provides results in various formats, among them JSON, which is typically used by JavaScript clients.

JavaScript code can query this API to search for article titles in a given wiki. jQuery implements the getJSON() method to asynchronously retrieve results. If more than one request is to be executed, the ajax() method has to used with the parameter mode set to ‘queue’.

Out of all available Wikimedia projects, wikipedia.org and wiktionary.org languages have been selected that are most closely related to CJK characters (Chinese, Japanese, Korean) or for which most translations exist in the data (English, German, French, Russian).

So from now on, if you search on YuJisho (for example: 東京 (Tokyo), 北京 (Beijing)), every result page will automatically perform a JavaScript search in various wikis, and provide links to the relevant wiki pages.


YuJisho: a Unicode CJK Character web dictionary

April 8, 2010

Chinese (or Japanese) characters have been fascinating me since I first learned about them in the early 90′s, and I immediately started some small programming projects dealing with this topic, among them a Kanji flash card application, one of my first Windows (3.1) programs.

Every now and again, I visited the websites of Jim Breen and Unicode, downloaded fonts, built a vocabulary trainer, and so on. One of the latest activities was an analysis of the Unicode Han Database.

There are a number of CJK dictionaries on the web, and the main objection I find with most of these websites is that you not only need to specify what you are looking for, but also need to tell the site where to look (e.g English, Japanese, Romaji, transcription method, etc).

I wanted to have a single input line with nothing else, and there should always be some kind of result.

Of course, I had to deal with performance-tuning the search algorithm, and I think it performs pretty well now.

A couple of problems I came across dealing with Far East scripts and Latin in the same SQL Server table:

When you look for a CJK character in an NVARCHAR column using the Latin1_General_CI_AS collation, the character may match any other character in that column. Switching to a collation supporting CJK, such as Chinese_PRC_90_CI_AI, solved the problem.

SQL Server 2000 did not handle surrogate pairs well with the available collation Chinese_PRC_CI_AI. According to this blog by Qingsong Yao, the collation Chinese_PRC_90_CI_AI and related collations of SQL Server 2005 solve the surrogate pair problem.

That all said, here is my online character dictionary, YuJisho. The name is a combination of the U in Unicode and the Japanese word for “dictionary”.

Any feedback is welcome ;)


Stripping Accents from Strings in C#

January 26, 2010

Unicode defines a concept called normalization (Unicode, Wikipedia) to define the equivalence of composed and decomposed representations of characters.

In .Net, the string.Normalize() method can be used to convert strings between normalization forms. If a string is in normalization form NormalizationForm.FormKD (full compatibility decomposition), the combing and modified marks are stored as separate characters, and their Unicode category can be retrieved calling the GetUnicodeCategory() method.

Thus, stripping the characters of a string from their accents, one has to perform the following steps:

  • Normalize the string into full compatibility decomposition
  • Remove the characters belonging to a “Mark” category
  • Return the result

Here is the C# code implementing this function:

using System.Text;
using System.Globalization;

public string StripAccents(string s)
{
  StringBuilder sb = new StringBuilder();
  foreach (char c in s.Normalize(NormalizationForm.FormKD))
    switch (CharUnicodeInfo.GetUnicodeCategory(c))
    {
      case UnicodeCategory.NonSpacingMark:
      case UnicodeCategory.SpacingCombiningMark:
      case UnicodeCategory.EnclosingMark:
        break;

      default:
        sb.Append(c);
        break;
    }
  return sb.ToString();
}

Follow

Get every new post delivered to your Inbox.