I came across a (MSSQL) database that almost only contained VARCHAR columns, even though the target audience was expected to come from all over the European Union.
In my projects, I am used to define NVARCHAR columns for all values that can be written by the users, and use VARCHAR only for strings that are known to be ASCII, such as program identifiers, URLs, or email addresses.
I tried to show the disadvantage of using VARCHAR by inputing accented characters, expecting a default Western collation, only to fail insofar as that the characters were displayed correctly. I found later that the collation had been explicitly set to a character set support these accented characters.
I prepared more thoroughly for the next time, and wrote a little TSQL script to illustrate my point:
create table #t ( id int identity, sA varchar(100), -- insert varchar sN varchar(100), -- insert nvarchar (result same as varchar) sC varchar(100) collate Croatian_CI_AS, -- insert nvarchar n nvarchar(100) ) declare @i int declare @s varchar(100) -- implicit collation of database declare @n nvarchar(100) set @s = '' set @n = '' set @i = 32 while @i < 1000 begin set @s = @s + nchar(@i) set @n = @n + nchar(@i) set @i = @i + 1 if (len(@s)=32) begin insert into #t (sA, n, sN, sC) values (@s, @n, @n, @n) print @s print @n set @s = '' set @n = '' end end select id as Row, sA as DBCollated, sC as HRCollated, n as Unicoded from #t order by id drop table #t
Table #T contains a couple of text columns:
- sA varchar with database collation is set by a varchar variable
- sN varchar with database collation is set by an nvarchar variable
- sC varchar with explicit collation is also set by an nvarchar variable
- n nvarchar is set by an nvarchar variable
In the loop, I construct strings with 32 Unicode characters, ranging from code point 32 to code point 1000 (that’s just arbitrary codes), and insert the same value into the 4 columns.
The findings of this experiment are:
VARCHAR variables have the same collation as the database. You cannot declare a COLLATE clause in the variable definition. Thus sA and sN contain the same values, as would be expected by mapping Unicode characters onto a code page.
The resulting table looks like this (only 16 characters per row for better readability):
VARCHAR Latin1_General_CI_AS | VARCHAR Croatian_CI_AS | NVARCHAR | |
Database Collation | Explicit Collation | Unicode | |
1 | !”#$%&'()*+,-./ | !”#$%&'()*+,-./ | !”#$%&'()*+,-./ |
2 | 0123456789:;<=>? | 0123456789:;<=>? | 0123456789:;<=>? |
3 | @ABCDEFGHIJKLMNO | @ABCDEFGHIJKLMNO | @ABCDEFGHIJKLMNO |
4 | PQRSTUVWXYZ[\]^_ | PQRSTUVWXYZ[\]^_ | PQRSTUVWXYZ[\]^_ |
5 | `abcdefghijklmno | `abcdefghijklmno | `abcdefghijklmno |
6 | pqrstuvwxyz{|}~ | pqrstuvwxyz{|}~ | pqrstuvwxyz{|}~ |
7 | ????????????? | ????????????? | |
8 | ?????????????? | ?????????????? | |
9 | ¡¢£¤¥¦§¨©ª«¬®¯ | !cL¤Y¦§¨©a«¬®— | ¡¢£¤¥¦§¨©ª«¬®¯ |
10 | °±²³´µ¶·¸¹º»¼½¾¿ | °±23´µ¶·¸1o»113? | °±²³´µ¶·¸¹º»¼½¾¿ |
11 | ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ | AÁÂAÄAAÇEÉEËIÍÎI | ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ |
12 | ÐÑÒÓÔÕÖרÙÚÛÜÝÞß | ?NOÓÔOÖ×OUÚUÜÝ?ß | ÐÑÒÓÔÕÖרÙÚÛÜÝÞß |
13 | àáâãäåæçèéêëìíîï | aáâaäaaçeéeëiíîi | àáâãäåæçèéêëìíîï |
14 | ðñòóôõö÷øùúûüýþÿ | ?noóôoö÷ouúuüý?y | ðñòóôõö÷øùúûüýþÿ |
15 | AaAaAaCcCcCcCcDd | AaĂ㥹ĆćCcCcČčĎď | ĀāĂ㥹ĆćĈĉĊċČčĎď |
16 | ÐdEeEeEeEeEeGgGg | ĐđEeEeEeĘęĚěGgGg | ĐđĒēĔĕĖėĘęĚěĜĝĞğ |
17 | GgGgHhHhIiIiIiIi | GgGgHhHhIiIiIiIi | ĠġĢģĤĥĦħĨĩĪīĬĭĮį |
18 | Ii??JjKk?LlLlLl? | Ii??JjKk?ĹĺLlĽľ? | İıIJijĴĵĶķĸĹĺĻļĽľĿ |
19 | ?LlNnNnNn???OoOo | ?ŁłŃńNnŇň???OoOo | ŀŁłŃńŅņŇňʼnŊŋŌōŎŏ |
20 | OoŒœRrRrRrSsSsSs | ŐőOoŔŕRrŘřŚśSsŞş | ŐőŒœŔŕŖŗŘřŚśŜŝŞş |
21 | ŠšTtTtTtUuUuUuUu | ŠšŢţŤťTtUuUuUuŮů | ŠšŢţŤťŦŧŨũŪūŬŭŮů |
22 | UuUuWwYyŸZzZzŽž? | ŰűUuWwYyYŹźŻżŽž? | ŰűŲųŴŵŶŷŸŹźŻżŽžſ |
23 | b????????Ð?????? | b????????Đ?????? | ƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏ |
24 | ?ƒƒ????I??l????O | ?Ff????I??l????O | ƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟ |
25 | Oo?????????t??TU | Oo?????????t??TU | ƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯ |
26 | u?????z????????? | u?????z????????? | ưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿ |
27 | |??!?????????AaI | |??!?????????AaI | ǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏ |
28 | iOoUuUuUuUuUu?Aa | iOoUuUuUuUuUu?Aa | ǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟ |
29 | ????GgGgKkOoOo?? | ????GgGgKkOoOo?? | ǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯ |
30 | j??????????????? | j??????????????? | ǰDZDzdzǴǵǶǷǸǹǺǻǼǽǾǿ |
31 | ???????????????? | ???????????????? | ȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏ |
32 | ???????????????? | ???????????????? | ȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟ |
33 | ???????????????? | ???????????????? | ȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯ |
34 | ???????????????? | ???????????????? | ȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿ |
35 | ???????????????? | ???????????????? | ɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ |
36 | ???????????????? | ???????????????? | ɐɑɒɓɔɕɖɗɘəɚɛɜɝɞɟ |
37 | ?g?????????????? | ?g?????????????? | ɠɡɢɣɤɥɦɧɨɩɪɫɬɭɮɯ |
38 | ???????????????? | ???????????????? | ɰɱɲɳɴɵɶɷɸɹɺɻɼɽɾɿ |
39 | ???????????????? | ???????????????? | ʀʁʂʃʄʅʆʇʈʉʊʋʌʍʎʏ |
40 | ???????????????? | ???????????????? | ʐʑʒʓʔʕʖʗʘʙʚʛʜʝʞʟ |
41 | ???????????????? | ???????????????? | ʠʡʢʣʤʥʦʧʨʩʪʫʬʭʮʯ |
42 | ?????????'”?’??? | ?????????'”‘’??? | ʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿ |
43 | ????^?ˆ?’¯´`?_?? | ????^?^ˇ’Ż´`?_?? | ˀˁ˂˃˄˅ˆˇˈˉˊˋˌˍˎˏ |
44 | ??????????°?˜??? | ????????˘˙°˛~˝?? | ːˑ˒˓˔˕˖˗˘˙˚˛˜˝˞˟ |
45 | ???????????????? | ???????????????? | ˠˡˢˣˤ˥˦˧˨˩˪˫ˬ˭ˮ˯ |
46 | ???????????????? | ???????????????? | ˰˱˲˳˴˵˶˷˸˹˺˻˼˽˾˿ |
47 | `´^~¯¯??¨?°???”? | `´^~ŻŻ˘˙¨?°?ˇ?”? | ̀́̂̃̄̅̆̇̈̉̊̋̌̍̎̏ |
48 | ???????????????? | ???????????????? | ̛̖̗̘̙̜̝̞̟̐̑̒̓̔̕̚ |
49 | ???????¸???????? | ???????¸???????? | ̡̢̧̨̠̣̤̥̦̩̪̫̬̭̮̯ |
50 | ?__????????????? | ?__????????????? | ̴̵̶̷̸̰̱̲̳̹̺̻̼̽̾̿ |
51 | ???????????????? | ???????????????? | ͇͈͉͍͎̀́͂̓̈́͆͊͋͌ͅ͏ |
52 | ???????????????? | ???????????????? | ͓͔͕͖͙͚͐͑͒͗͛͘͜͟͝͞ |
53 | ???????????????? | ???????????????? | ͣͤͥͦͧͨͩͪͫͬͭͮͯ͢͠͡ |
54 | ??????????????;? | ??????????????;? | ͰͱͲͳʹ͵Ͷͷͺͻͼͽ;Ϳ |
55 | ???????????????? | ???????????????? | ΄΅Ά·ΈΉΊΌΎΏ |
56 | ???G????T??????? | ???????????????? | ΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟ |
57 | ???S??F??O?????? | ???????????????? | ΠΡΣΤΥΦΧΨΩΪΫάέήί |
58 | ?aß?de??????µ??? | ??ß?????????µ??? | ΰαβγδεζηθικλμνξο |
59 | p??st?f????????? | ???????????????? | πρςστυφχψωϊϋόύώϏ |
60 | ???????????????? | ???????????????? | ϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟ |
One can easily see that the original Unicode values are translated into VARCHAR values according to the COLLATION setting. Some characters not contained in the collation are mapped onto characters without accents, others are simply replaced by question marks.
Use this code if you need to explain the differences between VARCHAR and NVARCHAR, and why using NVARCHAR is not a bad idea.
Pingback: Bad (MS SQL Server) habits to kick « devioblog
If your business rules dictate only ASCII characters, varchar has the advantage of taking less space and being more efficient. When we converted from nvarchar to varchar our data processing applications had a decrease in processing time of over 30%.
Pingback: Length of UTF-8 VARCHAR in SQL Server | devioblog