Different Strings considered Equal – Depending on your Collation

June 3, 2013

I recently came across this interesting question on SO:

Why does the statement

select case when  N'ܐܪܡܝܐ' = N'አማርኛ' then 1 else 0 end

return 1?

In case you are wondering, the first string seems to be Syriac/Aramaic for Aramaic, and the second string is Amharic for Amharic, as found on Wikipedia.

Obviously, the strings are not equal, but still the result of the comparison is true. But why?

The reason is, of course, that string comparison and sorting in SQL Server depend on the collation in use. And the collation used in the example was SQL_Latin1_General_CP1_CI_AS, which does not distinguish Aramaic and Amharic characters.

If you just want to test whether two strings are equal as in “equal Code Points” or “binary equal”, you need to select a specific collation which distinguishes the code points, such as Latin1_General_BIN:

select case 
    when N'ܐܪܡܝܐ' COLLATE Latin1_General_BIN = N'አማርኛ' COLLATE Latin1_General_BIN 
    then 1 else 0 end

So how can we analyze how collations affect comparison?

First, I created two tables: One stores records about character groups, the second stores the characters in each group that are considered equal by the collation:

CREATE TABLE [dbo].[CollationString](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Collation] [varchar](100) NOT NULL,
    [String] [nvarchar](50) NOT NULL,
    [StringInt] [int] NULL,
 CONSTRAINT [PK_CollationString] PRIMARY KEY CLUSTERED ([Id] ASC))
GO

CREATE TABLE [dbo].[CollationStringEquivalent](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [CollationStringId] [int] NOT NULL,
    [String] [nvarchar](50) NOT NULL,
    [StringInt] [int] NULL,
 CONSTRAINT [PK_CollationStringEquivalent] PRIMARY KEY CLUSTERED ([Id] ASC))
GO

I added a field Collation so that different collations can be compared later on.

Next, we iterate through all characters in the Unicode BMP, look up the character, and insert according to the result:

set nocount on 

declare @c int = 0
declare @coll nvarchar(50) = 'Latin1_General_CI_AS'
declare @cid int

while @c < 65536 begin

    print @c

    set @cid = null;
    select @cid = id 
        from collationstring 
        where collation collate Latin1_General_CI_AS 
            = @coll collate Latin1_General_CI_AS 
        and string = nchar(@c);
    
    if @cid is null begin
        insert into collationstring(collation, string, stringint) 
            values (@coll, NCHAR(@c), @c);
        select @cid = SCOPE_IDENTITY()
    end
        
    insert into collationstringequivalent(collationstringid, string, stringint) 
        values (@cid, NCHAR(@c), @c);

    set @c = @c + 1
end

After running this script, we can now query which characters are considered different

SELECT  * FROM [CollationString]

and which characters are equal

SELECT  * 
FROM [CollationStringEquivalent]
ORDER BY [CollationStringId], [StringInt]

To analyze the distribution of characters in one collation, we can start with a query like this

SELECT COUNT([Id]), [CollationStringId], MIN([String]), MIN([StringInt])
FROM [CollationStringEquivalent]
GROUP BY collationstringid
ORDER BY COUNT(id) DESC, MIN(stringint)

 


Taking a Database Offline in SQL Server Management Studio

June 2, 2013

To take a database offline in SSMS, simply right-click the database, select Tasks, Take Offline, and you’re done.

Except, if it does not work because there are still open connection to the database.

In this case, you need to find out the connections’ SPID

SELECT SPId FROM master..SysProcesses 
WHERE DBId = DB_ID('MyDatabase') AND SPId <> @@SPID

and kill each of the connections by executing

KILL [spid from query above]

Finally, you can take the database offline from its context menu, or, as you already work in a SSMS Query window, simply type

ALTER DATABASE [MyDatabase] SET OFFLINE WITH ROLLBACK IMMEDIATE

I just found this script today, so I’m afraid I cannot give you any sources of this magic wisdom.

I did however find an older post here dealing with SSMS Express Edition.


SMOscript and Change Tracking

May 7, 2013

SMOscript user Robert pointed out that SMOscript did not support the Change Tracking option of the ScriptingOptions parameter class.

Indeed, of the new Change Tracking features found in SQL Server 2008 and higher, SMO only supports scripting Change Tracking, but does not generate the commands to script Change Data Capture.

This seems to be a known omission of SMO, and it will not be implemented any time soon:

We took a look at this DCR (this is not defect but actually DCR to add suport of SMO to CDC) along with several others. Unfortunately based on current customers votes count we decided not to proceed with this DCR in the next release. However, we have taken note of this internally, and when we revisit this functionality in the future, we will try and get this implemented.

which is kind of strange considering you use SMO to re-create an existing database schema in a new schema, and maybe your T-SQL business logic relies on objects created by CDC.

Nonetheless, SMOscript 0.20 now also create the ENABLE CHANGE_TRACKING if used with the -ct switch.

As usual, the latest version of SMOscript is available for download here.


Change Tracking and Change Data Capture

May 7, 2013

SQL Server 2008 introduced 2 features that allow tracking changes to table data, namely Change Tracking and Change Data Capture.

Change Tracking is enabled on database level using the statement

ALTER DATABASE [mydb]
SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON)

and on table level using

ALTER TABLE [myTable]
ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON)

To query the server which databases are actually tracking-enabled, use the following query

SELECT * FROM [sys].[change_tracking_databases]

The list of tracking-enabled tables can be retrieved by this statement

SELECT * FROM [sys].[change_tracking_tables]

Change Data Capture is enabled using the stored procedure sys.sp_cdc_enable_on:

EXEC sys.sp_cdc_enable_db

If your database editition does not support CDC, you will receive the error:

Msg 22988, Level 16, State 1, Procedure sp_cdc_enable_db, Line 12
This instance of SQL Server is the [***] Edition. Change data capture is only available in the Enterprise, Developer, and Enterprise Evaluation editions.

CDC adds the cdc schema and other database objects

When a database is enabled for change data capture, the cdc schema, cdc user, metadata tables, and other system objects are created for the database.

so the easiest method to detect whether CDC is enabled is to check for the [cdc] schema.

A table can be enabled for CDC using the SP sys.sp_cdc_enabled_table:

EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name   = N'MyTable',
@role_name     = N'MyRole',
@filegroup_name = N'MyDB_CT',
@supports_net_changes = 1

The parameter @role_name must be provided, but can be NULL. The following parameters are optional.

To find out which tables are CDC-enabled, use the SQL statement

SELECT * FROM [cdc].[change_tables]

On Converting Data

May 2, 2013

I had to analyze SQL Server Database Projects (available from SQL Server Data Tools for Visual Studio), and these projects offer a menu item “Create Snapshot” which creates a snapshot file with the extension .dacpac.

It turns out that a .dacpac is a zipped XML file (plus some other files) containing a structured representation of the database objects defined in the project. However Visual Studio does not provide a way to display them (double-clicking the file will only display binary data).

So I thought about how to best display the contents of a .dacpac? Two methods came to my mind.

First, inspired by my work on wpxslgui, create an XSLT style sheet which transforms the contents of the XML file to some legible text, for example CREATE TABLE and similar TSQL statements.

Intuitively I called this approach Symbolic Transformation.

Symbolic Transformation (e.g. XSLT)
Representation 1 => Representation 2

On the other hand, a Logical Transformation contains one module parsing the information contained in “Representation 1″ into some kind of model, and another module creating the “Representation 2″ of that model. The two module can be  implementations of the Interpreter pattern and the Builder or Factory patterns, respectively.

Logical Transformation (simple)
Interpreter Builder
Representation 1 => Model => Representation 2

If we take Representation 1 and Representation 2 as two separate interfaces to the same business model, and want to support two-way operations, we can extend the last table like this:

Logical Transformation (extended)
Interpr.
Builder
Converter Converter Interpr.
Builder
Repr. 1 => Model 1 => Business Model => Model 2 => Repr. 2
<= <= <= <=

Why do we need to have Model 1 and Model 2, as they seem to make the whole thing even more complex?

Let’s have a look at a simple CREATE TABLE statement and some of their representations:

  • a SQL parser (think: ANTLR) uses the representation given by the SQL parser
  • the SQL Server catalog views sys.tables, sys.columns, etc. are a different representation
  • a .dacpac archive is another representation

To keep our code simple, our Model X class structure should be as close to the representation as 1) possible 2) necessary (thinking about proxy classes generated by xsd.exe, ANTLR, and ORMs).

Thus, a common data model (named Business Model in the table) is required, as well as 2-way conversion between the Business Model and each of the other models.


Copying Hierarchical Data using INSERT+MERGE+OUTPUT

April 19, 2013

Suppose you have a hierarchical data structure consisting of the tables

  • parent (Id)
  • children (Id, ParentId => parent)
  • grandchildren (Id, ParentId => children)

and you want to copy a parent record and all its children and grand-children.

You start out with

INSERT INTO parent (non-identity fields)
SELECT [non-identity fields]
FROM parent WHERE Id = @id
SELECT @newid = SCOPE_IDENTITY()

Next you want to copy the child records, and sketch something like this

INSERT INTO children (ParentId, [fields except identity])
SELECT @newid, [fields except identity]
FROM children WHERE ParentId = @id

That would work by itself, but not if you want to copy the grand-children, as you need not only the set of the newly generated identities, but also how they map to the original IDs.

So you try to add an OUTPUT clause

INSERT INTO children (ParentId, [fields except identity])
OUTPUT INSERTED.Id, [we need the original id here]
SELECT @newid, [fields except identity]
FROM children WHERE ParentId = @id

and you find that you can only have column names of the columns of the INSERT operation, but the original ChildId is not part of the Insert!

Fortunately I found help on this blog which suggests using MERGE OUTPUT for the child table:

  • We declare the data to be selected (and inserted) as sub-select statement of the MERGE USING clause.
  • The MERGE condition 0=1 means that records are inserted only.
  • The OUTPUT clause can refer to source AND target columns

The result looks like this:

MERGE Children
USING (
    SELECT @newid AS [NewId], [fields to be inserted], Id
    FROM Children
    WHERE ParentId = @id
) AS original
ON 0=1
WHEN NOT MATCHED BY TARGET THEN
    INSERT (ParentId, [fields to be inserted])
    VALUES ([NewId], [fields to be inserted])
OUTPUT inserted.Id, original.Id INTO @map

@map is a table variable containing the mappings of old and new IDs:

DECLARE @map TABLE (
    Id INT,
    OriginalId INT
)

or whatever your Id datatype is.

The grand-children records can simply be copies using an INSERT statement, if no further detail tables exist:

INSERT INTO GrandChildren (ParentId, [fields to be inserted])
SELECT map.Id, [fields to be inserted]
FROM GrandChildren
INNER JOIN @map map ON GrandChildren.ParentId = map.OriginalId

If you have more than 3 levels to copy, you need one mapping table and one MERGE statement for each additional level.


T-SQL ‘GO’ and SqlCommand

April 9, 2013

Try to execute a T-SQL script which contains several SQL batches separated by the GO keyword using .Net’s SqlCommand, and you will receive an error message

Incorrect syntax near ‘GO’.

What’s going on here? Why does it work from SSMS or sqlcmd, but not in SqlCommand?

The reason for this behavior is that ‘GO’ is not a statement of T-SQL, but rather

a command recognized by the sqlcmd and osql utilities and SQL Server Management Studio Code editor

(MSDN). You think you might simply remove all occurrences of GO inside your statement, but that will not always work: Some statements require to be the first statement in a batch

‘CREATE/ALTER PROCEDURE’ must be the first statement in a query batch.

So how do you execute a set of SQL statements containing ‘GO’ in C#?

The “lazy” way is to simply split the SQL string at each occurrence of ‘GO’ using string.Split(), but this is really lazy.

Why? Because of the complexity of the language that the T-SQL shells are capable of processing:

  • string literals ” and N”
  • single line comments –
  • multi-line comments /* */
  • nested multi-line comments (!)

The characters ‘GO’ must not be recognized if they occur inside any of these constructs, and they need not be the first characters in a line – the line may also contain spaces or whitespace or comments (just give it a try in SSMS, it’s truly amazing).

The more practical way (mentioned here) to execute an arbitrary SQL string is to use SMO’s Server.ConnectionContext.ExecuteNonQuery().

Or, you write a parser using ANTLR, define the correct grammar to split at the relevant GO keywords, and execute the list of resulting SQL batches separately. ;)


OpenStreetMap for MS SQL Server

March 19, 2013

I previously wrote about importing OpenStreetMap data into MS SQL Server, and accidentally came across another project dealing with this topic.

osm2mssql not only imports data from OSM files, but also uses the SQL Server Geography types to encode the imported data. osm2mssql is hosted on CodePlex.


SQL Server Wishlist

March 4, 2013

Occasionally I write about my (and others’) wishes for future versions of C#.

Recently I came across the website of Erland Sommarskog, and besides great articles about various aspects of SQL Server, he also maintains a wishlist for SQL Server.

Enjoy ;)


Restoring a SQL Server Database Backup Created by a Newer Version of SQL Server

March 1, 2013

It is widely known that you can only restore a database backup using the same or a newer version of SQL Server.

There is however a workaround to this restriction. The workaround consists of scripting database and contents and importing the scripts using sqlcmd.

Here’s my scenario:

A database is being developed on SQL 2008 [A], and goes live on SQL 2008 R2 [B]. There is no problem copying the database from one system [A] to the other [B].

Then an error occurs, or something must be debugged or analysed, but developers only have SQL 2008 installed.

A newer version is available [C], but not on the develop machines.

So the SQL Server versions can be summarized as: [A] < [B] <= [C]

And this is the workaround:

  • Create a full database backup on [B]
  • Restore the backup in a new database on [C]
  • In SSMS, locate the new database, and right-click
  • Tasks/Generate Scripts

First, we create the DDL script

  • Save scripts to a specific location
  • Save to file: Single file, Save As: Unicode text

Advanced:

  • Generate Script for Dependent Objects: True
  • Schema qualify object names: True
  • Script Collations: if you need them
  • Script Defaults: True
  • Script CREATE
  • Script for Server Version: select version for [A]
  • Script Logins, Owner, Permissions, etc: as required
  • Types of data to script: Schema only
  • Script Check Constraints, Foreign Keys, Full-Text Indexes, Indexes, Primary Keys, Triggers, Unique Keys: True

Next, Next until DDL file is being generated

Next, we create the Data script. Same steps as above, but

  • Types of data to script: Data only

This may take a while…

In the meantime, we can edit the original DDL file. Open the file (I’ll refer to it a script.sql) in a Unicode-capable text editor.

Since we want to create the database on a dev machine, I tend to skip the DDL statements creating the database:

  • Starting from USE [master] up to the [USER <mydatabase>]. Depending on the contents of the database, you may also want to skip the CREATE USER and CREATE ROLE statements.
  • Eventually, you’ll find the first CREATE (PROCEDURE, TABLE, VIEW, etc.) statement. This is where is gets interesting for us.
  • This CREATE section lasts until you find the first CREATE INDEX or ALTER (TABLE) statement.
  • Copy this section to a new file, as save it (say, script1.sql).
  • In script.sql, select from the first CREATE INDEX or ALTER statement down to the last statement before USE [master]. Save this section as script2.sql.

We end up having these files:

  • script.sql .. the original Schema-only file
  • script1.sql .. the CREATE statements
  • script2.sql .. the indexes and constraints
  • data.sql .. the original Data-only file

The DDL scripts are typically not so large, and can be executed from inside SSMS. The Data file may be more problematic, as even a small database can generate a huge (several GB!) data SQL file.

If SSMS is capable of handling all the SQL files, the correct sequence is:

  • script1.sql .. the CREATE statements
  • data.sql .. the original Data-only file
  • script2.sql .. the indexes and constraints

Why do we split the original Schema-only file? Well, the schema defines constraint, such as foreign keys. Typically, the generation of the data script does not take care of such foreign key constraints. That’s why we apply the constraints only after adding the data.

What to do if the data file is too big for SSMS? We’ll run it from the command line:

Open the command prompt (Win+Run, cmd.exe) and try to connect to your SQL Server [A] using sqlcmd. Type sqlcmd -? to get the list of options.

If you have Windows authentication enabled and SQL Server runs on your local machine, the statement

sqlcmd -d [mydatabase] -Q "SELECT 1"

verifies you can connect to SQL Server. If successful, the expected result will be displayed:

-----------
          1

(1 rows affected)

Use the -i command line switch to declare the data.sql as input file:

sqlcmd [all other switches] -i c:\path\to\data.sql

Again, this command may be busy for quite some time. If you have no idea what is going on, and really really want to know, use this script to watch the inserts from SSMS.


Follow

Get every new post delivered to your Inbox.