Importing mysqldump into MS SQL Server

To import data from MySQL into a MS SQL database, we need to first analyze the output of mysqldump, the backup utility of MySQL. I have covered the most obvious differences to T-SQL in my previous post.

Fortunately, the output of mysqldump is line-oriented, i.e. every CREATE TABLE statement, table column or index, and INSERT statement is in a separate line.

In a brute-force approach, we can therefore parse each line using regular expressions, and do not require full parsing involving tokenizer and grammar.

I found a couple of interesting points that need to be considered when generating T-SQL statements:

  • As T-SQL did not support “DROP IF EXISTS” before SQL Server 2016, we need to provide the classical variant “IF OBJECT_ID() IS NOT NULL”
  • NVARCHAR(MAX) columns cannot be indexed
  • BINARY data needs to be 0x-encoded
  • UNIQUE INDEXes should not include NULLable columns
  • Multi-row INSERT INTO statements cannot contain more than 1000 rows
  • ‘0000-00-00’ is a valid date in MySQL, but not in MSSQL. The value is being replaced by ‘1800-01-01’ to avoid the error message

    The conversion of a varchar data type to a datetime data type resulted in an out-of-range value.

Some indexes may require manually editing column collations, e.g. for unique case-sensitive MySQL indexes.

When executing T-SQL statements in SSMS, a number of warnings or errors can occur:

  • A warning that indexed data is supported to a maximum length of 960 bytes
  • Inserting data may result in the error “String or binary data would be truncated.”
  • Some files are too big to execute in SSMS, and need to be executed using sqlcmd
  • A batch with too many too large INSERT INTO statements may raise the error

    There is insufficient system memory in resource pool ‘default’ to run this query

    so we break up the batch using “GO”

  • Multiple cascade paths

MS SQL Server does not support ON DELETE CASCADE clauses such that the grandchild record of a parent-child-grandchild relation cannot be uniquely identified. Creating such a FOREIGN KEY raises the error message

Msg 1785, Level 16, State 0, Line 1
Introducing FOREIGN KEY constraint ‘fk_xxx’ on table ‘xxx’ may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.

In such cases, either drop the ON DELETE CASCADE if possible, and replace the deletion cascade with an INSTEAD OF DELETE trigger.

I developed mysqldump2mssql, a prototype implementing these rules, which is available on my GitHub. The program has been developed and tested against MySQL dumps of MediaWiki, Joomla, Bugzilla and Nextcloud.

The tool generates up to 4 files for each table defined in the source SQL file:

  • create.sql: the CREATE TABLE statement, optional CREATE UNIQUE INDEX statements
  • data.sql: the INSERT INTO statements
  • fk.sql: the FOREIGN KEY relations to other tables
  • fulltext.sql: the CREATE FULLTEXT statement

The filenames of the generated files consist of the sequence number of execution, the table name, and one of the endings mentioned in above table.

Parsing mysqldump files

MySQL and MS SQLServer both can be accessed using SQL, but if you look closely at both SQL dialects, you’ll find a lot of differences.

Backups of MySQL databases are created using the utility mysqldump, which simply dumps all the SQL statements required to recreate the database into a SQL file, as opposed to SQLServer backup, which creates a binary file representing the database contents.

As we are interesting in parsing the SQL and converting it into valid T-SQL statements, we have to look at the differences between the SQL dialects. Fortunately, since the SQL files are output by a tool, the generated statements are fairly regular.

Here are some special features of the MySQL dialect, which need to be handled for conversion into T-SQL:

  • Identifiers are quoted with a backtick `, rather than square brackets
  • IDENTITY columns are marked by the auto_increment keyword
  • Binary columns may store text values
  • Text values may contain character codes with leading backslash \, as known from C# or C
  • Values for DATETIME, DATE and TIME columns need not be quoted as strings
  • ‘0000-00-00’ is a valid DATE
  • A CREATE TABLE statement may include the definition of indexes, unique indexes and fulltext indexes
  • Both SQL engines provide support for collations, but they are completely different

I have previously written about converting MySQL SQL into T-SQL manually, but having a tool perform the repetitive actions would be very nice 😉

SMOscript 0.50

I added a couple of new options to my scripting tool SMOscript:

  • The new -td command-line switch generates two files for tables: a [table].create.sql file containing the CREATE TABLE statement, and a [table].details.sql file containing the CREATE statements all table details (foreign keys, indexes, triggers), thus allowing recreating of a database schema without referential integrity problems
  • The -o (object name) switch semantics has been extended to other commands than script
  • The new -ol command-line switch implements a filter on object names similar to T-SQL LIKE. The filter is applied on the object name, and if not matching, on the combination “schema.objectname”.

The new command dd operates on data diagrams (see my previous post on SQL Server Data Diagrams):

  • Using dd without parameters will list all data diagrams stored in the database.
  • Using dd -o [diagram name] will list all tables contained in the data diagram, even tables that have been DROPped.
  • Using dd -o [diagram name] -dt lists the DROPped tables only.

The latest version of SMOscript is available for download on my download website.

Accessing MS Access Databases from SQL Server

I recently installed WinTV 8.5, and after scanning TV channels, I noticed that editing channel numbers was incredibly cumbersome.

According to the FAQ, it would involve a lot of scrolling, and 4 clicks – *per channel*. I also noticed a couple other peculiarities in the UI, such as the EPG viewer, which made me research where the program’s data is stored.

Actually it’s all in the FAQ: it’s an MS Access database called hcwChanDB_5.mdb residing in a directory set during installation.

To save your channel database go to:

C:\Users\Public\WinTV\Channel Database (Windows 7 and Vista)

C:\Documents and Settings\All Users\Shared Documents\WinTV\Channel Database (Windows XP)

Copy the  hcwChanDB_5 file to a place  for safe keeping. To replace saved Channels just copy HcwChanDB_5 back to Channel Database Dir.

When I double-click the .mdb file, a nice reminder shows up that I only have a Test edition of MS Access, and not the fully activated Professional edition. Essentially this means that I can view the data, but cannot change it.

Time to start up SSMS, link the Access database and browse the data from SSMS. … which turned out to be not so easy.

Linked Server to MS Access via ODBC System DSN

WinTV creates an ODBC System DSN called HCW_mdb, and you can easily create a linked server in SSMS referencing this DSN. Running

exec sp_catalogs 'hcw'

on the linked server returns the path names of the MS Access databases, but using them as catalog names in a SELECT * FROM statement only raises an error message

OLE DB provider “MSDASQL” for linked server “hcw” returned message “[Microsoft][ODBC Microsoft Access Driver] Syntaxfehler (fehlender Operator) in Abfrageausdruck ‘`Tbl1002`.`id` `Col1005`’.”.
Msg 7321, Level 16, State 2, Line 4
An error occurred while preparing the query “SELECT `Tbl1002`.`id` `Col1005`,`Tbl1002`.`service_id` `Col1006`,`Tbl1002`.`preferred_name` `Col1007`,`Tbl1002`.`preferred_number` `Col1008`,`Tbl1002`.`description` `Col1009`,`Tbl1002`.`audio_lang` `Col1010`,`Tbl1002`.`subtitle_lang` `Col1011`,`Tbl1002`.`audio_mode` `Col1012`,`Tbl1002`.`irblaster_data` `Col1013`,`Tbl1002`.`channel_type` `Col1014`,`Tbl1002`.`language` `Col1015`,`Tbl1002`.`genre` `Col1016`,`Tbl1002`.`duplicate_channel_id` `Col1017`,`Tbl1002`.`duplicate_priority` `Col1018`,`Tbl1002`.`blocked_flag` `Col1019`,`Tbl1002`.`last_used` `Col1020`,`Tbl1002`.`epg_source` `Col1021`,`Tbl1002`.`preferred_minor` `Col1003` FROM `D:\data\WinTV\Channel Database\hcwChanDB_5`.`hcwChannels` `Tbl1002`” for execution against OLE DB provider “MSDASQL” for linked server “hcw”.

However, you can query the database using OPENQUERY()

select * from openquery(hcw, 'select * from hcwChannels')

where hcwChannels is one of the tables in the MS Access database.

Linked Server to MS Access via OLEDB provider

I kept searching, and fortunately found a forum post on social.MSDN in which the last comment provided a hint to another workable solution:

Note: The same issue is present using Sql Server 2012 RC0, and still forcing the use of the deprecated MSDASQL provider (ms ole db provider for odbc drivers) as a workaround.

Update: besides using msdasql, it works, in some cases, to do the following before creating the linked server:
right-click and select “properties” for “Microsoft.ACE.OLEDB.12.0” under Server objects, Linked servers, Providers in SSMS.  Check “nested queries” and “allow inprocess.”   Then create the linked server for access accdb using the ACE.OLEDB.12.0 provider, with the access filename under data source.

So I checked the OLEDB provider settings and changed them accordingly. Then I created a linked server using Microsoft.ACE.OLEDB.12.0 referencing the hcwChanDB_5.mdb database.

While I could not enumerate the catalogs of the new linked server, the database could now be queried using standard SELECT statements such as

select * from hcwole...hcwChannels

I could now create a SYNONYM for each table inside the MS Access database.

See my WinTV repository on GitHub for the SQL files.

 

Anonymizing Personal Data

Databases used to collect all the data ever entered, and running out of disk space was solved by adding some new storage device.

Then came GDPR and companies storing personal data (or personally identifyable data) suddenly had to consider Data Deletion Policies and Data Retention Policies.

The easiest way is of course to delete the records storing obsolete personal data. But sometimes (read: almost always) it is not possible to delete the record, because the mere existence of a record may be important for historical or statistical reasons (e.g. the query “number of purchases per time interval and region” would almost certainly refer to a person and their location).

So it’s more realistic that certain data of a person is kept, some is erased, and the personal identifiers (first name, last name) are simply anonymized or pseudonymized.

What would be easier to replace the name fields with an arbitrary combination of letters?

To make the pseudonyms more beautiful, it would be nice if we considered a typical syllable structure, the most basic would be C-V-C (consonant vowel consonant), or any extension thereof: C-V-C-V-C, C-V-C-C-V, etc.

So we define the letters for consonants and vowels as

DECLARE @cons NVARCHAR(100) SET @cons = 'bcdfghjklmnpqrstvwxyz'
DECLARE @vow NVARCHAR(100) SET @vow = 'aeiouäöüaeiouaeiou'

and select a random consonant or vowel using the expressions

SUBSTRING(@cons, (ABS(CHECKSUM(NEWID())) % LEN(@cons))+1, 1)
SUBSTRING(@vow, (ABS(CHECKSUM(NEWID())) % LEN(@vow))+1, 1)

We use the function ABS(CHECKSUM(NEWID())) to generate random numbers rather than RAND() because this expression is evaluated for every record in a multi-record statement, rather than once. To illustrate the difference, compare the result of the query

SELECT RAND(), ABS(CHECKSUM(NEWID())) FROM sys.objects

Note that by repeating a letter inside either strings you can change the relative frequency of that letter in the resulting name.

By prepending ‘Zy’ to the generated string we can make sure that the generated names are easily recognize, and occur at the end of lists.

Replacing a persons name with a generated pseudonym may be as easy as this code:

DECLARE @cons NVARCHAR(100) SET @cons = 'bcdfghjklmnpqrstvwxyz'
DECLARE @vow NVARCHAR(100) SET @vow = 'aeiouäöüaeiouaeiou'
DECLARE @cons1 NVARCHAR(100) SET @cons1 = 'bcdfghjklmnpqrstvw'
DECLARE @vow1 NVARCHAR(100) SET @vow1 = 'aeiou'

UPDATE dbo.PersonTable
SET LastName = ' Zy' 
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@vow, (ABS(CHECKSUM(NewId())) % LEN(@vow))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@vow, (ABS(CHECKSUM(NewId())) % LEN(@vow))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1),
    FirstName = 'Zy'
    + SUBSTRING(@cons1, (ABS(CHECKSUM(NewId())) % LEN(@cons1))+1, 1)
    + SUBSTRING(@vow1, (ABS(CHECKSUM(NewId())) % LEN(@vow1))+1, 1)
    + SUBSTRING(@cons1, (ABS(CHECKSUM(NewId())) % LEN(@cons1))+1, 1),
WHERE [AnonymizeCondition] = 1

 

AnonymizeCondition refers to the condition you determined when a personal record needs to be anonymized, and has not yet been anonymized.

See the code in my GitHub repository with T-SQL snippets.

The great Database Diagram mystery

You may (or may not) know that the Database Diagrammer of SSMS stores the created diagrams in a table called dbo.sysdiagrams.

This table has always been a mystery to me, especially the column “definition”, as it contains binary data, and documentation cannot be found about its format.

So when I tried to figure out what all the bytes meant, and my standard editor plus hex viewer plug-in (read: Notepad++) could not help me any further, I searched and found FlexHEX which promised to provide the functionality to annotate and structure binary data. Somehow this did not work out as expected – I could not figure out how to create and edit “structures” – but as I clicked around, I hit the menu “Open OLE Compound File…” as a last resort. What could go wrong.

flexhex compound file.png

To my great surprise, this command actually opened the file and displayed a file structure, with each of the sections containing some data.

As it turned out, the section DSREF-SCHEMA-CONTENTS was the one I was looking for, as it contains, as I later found out, all the textual information in the diagram:

  •  (original) diagram name
  • connection string
  • list of all tables in the diagram (as of last save)

flexhex data diagram.png

This section is stored as a Microsoft Compound File or Compound File Binary Format or Compound Document File, and the easiest way to process it in C# is the nuget package OpenMcdf (source on GitHub).

Writing a small tool, I verified that the sections from the binary data really have those names, and are not artifacts of the hex editor.

Once I got hold of the data contained in section DSREF-SCHEMA-CONTENTS, making sense of the structure of this data was quite straight-forward.

SQL Server Sysobjects Types (Version 2019)

MS SQL Server stores the information about the objects (tables, views, functions, etc) stored in the database in a table, which is accessible by selecting from the sys.objects view. The type of object is given by the column “type”

I collected this cross-version information from Microsoft Docs online, but fixed a couple of historical mistakes there.

Object type: 2008 2012 2016 2019
sys.objects.type
AF Aggregate function (CLR) x x x x
C CHECK constraint x x x x
D DEFAULT (constraint or stand-alone) x x x x
EC Edge constraint x
ET External Table x x
F FOREIGN KEY constraint x x x x
FN SQL scalar function x x x x
FS Assembly (CLR) scalar function x x x x
FT Assembly (CLR) table-valued function x x x x
IF SQL inline table-valued function x x x x
IT Internal table x x x x
P SQL stored procedure x x x x
PC Assembly (CLR) stored procedure x x x x
PG Plan guide x x x x
PK PRIMARY KEY constraint x x x x
R Rule (old-style, stand-alone) x x x x
RF Replication-filter-procedure x x x x
S System base table x x x x
SN Synonym x x x x
SO Sequence object x x x
SQ Service queue x x x x
TA Assembly (CLR) DML trigger x x x x
TF SQL table-valued-function x x x x
TR SQL DML trigger x x x x
TT Table type x x x x
U Table (user-defined) x x x x
UQ UNIQUE constraint x x x x
V View x x x x
X Extended stored procedure x x x x

The sys.objects.type_desc column contains the description corresponding to the type:

type sys.objects.type_desc
AF AGGREGATE_FUNCTION
C CHECK_CONSTRAINT
D DEFAULT_CONSTRAINT
EC
ET
F FOREIGN_KEY_CONSTRAINT
FN SQL_SCALAR_FUNCTION
FS CLR_SCALAR_FUNCTION
FT CLR_TABLE_VALUED_FUNCTION
IF SQL_INLINE_TABLE_VALUED_FUNCTION
IT INTERNAL_TABLE
P SQL_STORED_PROCEDURE
PC CLR_STORED_PROCEDURE
PG PLAN_GUIDE
PK PRIMARY_KEY_CONSTRAINT
R RULE
RF REPLICATION_FILTER_PROCEDURE
S SYSTEM_TABLE
SN SYNONYM
SO SEQUENCE_OBJECT
SQ SERVICE_QUEUE
TA CLR_TRIGGER
TF SQL_TABLE_VALUED_FUNCTION
TR SQL_TRIGGER
TT TYPE_TABLE
U USER_TABLE
UQ UNIQUE_CONSTRAINT
V VIEW
X EXTENDED_STORED_PROCEDURE

My previous post on this topic only covered SQL Server up to version 2008.