This is a follow-up article on my previous post about string equality and collations.
We know, simply from looking at them, that the characters
'አ' are different, but the collation you use may treat them as equal.
Well, we can still compare their code point value by using the UNICODE() function, as in
select unicode(N'ܐ'), unicode(N'አ')
returning 1808 and 4768.
The reason I write this is because I discovered a fallacy in a comment on SO, resulting from mixing Unicode and non-Unicode literals and functions.
Take the statement
select ascii('ܐ'), ascii('አ')
Note that the Unicode characters are not given as Unicode strings (N” notation), but as non-Unicode strings.
Since both characters cannot be mapped onto Latin ASCII (or whatever your collation is), they are replaced by a Replacement Character, which is the question mark ‘?’ in ASCII.
Wikipedia tells us
The question mark character is also often used in place of missing or unknown data. In Unicode, it is encoded at U+003F ? question mark (HTML:
In many web browsers and other computer programs, “?” is used to show a character not found in the program’s character set. […] Some fonts will instead use the Unicode Replacement Glyph (U+FFFD, �), which is commonly rendered as a white question mark in a black diamond (see replacement character).
So we can see where the question mark comes from, and thus both functions return 63.
In a similar, but nonetheless different case
select ascii(N'ܐ'), ascii(N'አ')
the characters are defined as Unicode strings, but passed to a function that only accepts non-Unicode strings. In this case, the mapping according to the current collation is performed by the ASCII() function, again resulting in the value 63.
As for the Unicode Replacement Character, you’ll encounter them if you decode a byte array to Unicode, and the decoder encounters a byte sequence that cannot be converted to Unicode given the selected encoding.