Splitting long text columns into multiple NVARCHAR records

Management Studio truncates strings displayed in the result grid which keeps you from viewing the whole string stored in a text column.

A question on Stack Overflow made me create this SELECT statement to split long strings into several records:

SELECT texts.id, 
       SUBSTRING(texts.text_column, number*100+1, 100)
FROM
    (SELECT texts.id, texts.text_column, 
        (99+LEN(texts.text_column))/100 AS l
     FROM texts) AS texts
OUTER APPLY
    (SELECT TOP(l) number
     FROM master.dbo.spt_values val
     WHERE name IS NULL
     ORDER BY number) n

In this example, the resulting strings are 100 characters long. The built-in table master.dbo.spt_values contains integer values from 0 to 2047.

Transliterating Cyrillic into Latin alphabet with T-SQL

After sketching my CTE-based multi-REPLACE procedure, I wanted to test in on a real-life scenario. First example is a translation table to convert the cyrillic alphabet into Latin.

First we need to define the data and the translation tables using a case-sensitive cyrillic collation with COLLATE Cyrillic_General_CS_AS:

CREATE TABLE #T (
    level int identity(1,1),
    old    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS,
    new    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS
)

CREATE TABLE #Data (
    data    NVARCHAR(50) COLLATE Cyrillic_General_CS_AS
)

Next, we define the transliterations. See here for a list of standardized transliterations.

INSERT INTO #T VALUES(N'ый', 'y')

INSERT INTO #T VALUES(N'ЫЙ', 'Y')

INSERT INTO #T VALUES(N'а', 'a')
INSERT INTO #T VALUES(N'б', 'b')
INSERT INTO #T VALUES(N'в', 'v')
INSERT INTO #T VALUES(N'г', 'g')
INSERT INTO #T VALUES(N'д', 'd')
INSERT INTO #T VALUES(N'е', 'e')
INSERT INTO #T VALUES(N'ё', 'yo')
INSERT INTO #T VALUES(N'ж', 'zh')
INSERT INTO #T VALUES(N'з', 'z')
INSERT INTO #T VALUES(N'и', 'i')
INSERT INTO #T VALUES(N'й', 'y')
INSERT INTO #T VALUES(N'к', 'k')
INSERT INTO #T VALUES(N'л', 'l')
INSERT INTO #T VALUES(N'м', 'm')
INSERT INTO #T VALUES(N'н', 'n')
INSERT INTO #T VALUES(N'о', 'o')
INSERT INTO #T VALUES(N'п', 'p')
INSERT INTO #T VALUES(N'р', 'r')
INSERT INTO #T VALUES(N'с', 's')
INSERT INTO #T VALUES(N'т', 't')
INSERT INTO #T VALUES(N'у', 'u')
INSERT INTO #T VALUES(N'ф', 'f')
INSERT INTO #T VALUES(N'х', 'kh')
INSERT INTO #T VALUES(N'ц', 'c')
INSERT INTO #T VALUES(N'ч', 'ch')
INSERT INTO #T VALUES(N'ш', 'sh')
INSERT INTO #T VALUES(N'щ', 'shch')
INSERT INTO #T VALUES(N'ъ', '')
INSERT INTO #T VALUES(N'ы', 'y')
INSERT INTO #T VALUES(N'ь', '''')
INSERT INTO #T VALUES(N'э', 'e')
INSERT INTO #T VALUES(N'ю', 'yu')
INSERT INTO #T VALUES(N'я', 'ya')

INSERT INTO #T VALUES(N'А', 'A')
INSERT INTO #T VALUES(N'Б', 'B')
INSERT INTO #T VALUES(N'В', 'V')
INSERT INTO #T VALUES(N'Г', 'G')
INSERT INTO #T VALUES(N'Д', 'D')
INSERT INTO #T VALUES(N'Е', 'E')
INSERT INTO #T VALUES(N'Ё', 'YO')
INSERT INTO #T VALUES(N'Ж', 'ZH')
INSERT INTO #T VALUES(N'З', 'Z')
INSERT INTO #T VALUES(N'И', 'I')
INSERT INTO #T VALUES(N'Й', 'Y')
INSERT INTO #T VALUES(N'К', 'K')
INSERT INTO #T VALUES(N'Л', 'L')
INSERT INTO #T VALUES(N'М', 'M')
INSERT INTO #T VALUES(N'Н', 'N')
INSERT INTO #T VALUES(N'О', 'O')
INSERT INTO #T VALUES(N'П', 'P')
INSERT INTO #T VALUES(N'Р', 'R')
INSERT INTO #T VALUES(N'С', 'S')
INSERT INTO #T VALUES(N'Т', 'T')
INSERT INTO #T VALUES(N'У', 'U')
INSERT INTO #T VALUES(N'Ф', 'F')
INSERT INTO #T VALUES(N'Х', 'KH')
INSERT INTO #T VALUES(N'Ц', 'C')
INSERT INTO #T VALUES(N'Ч', 'CH')
INSERT INTO #T VALUES(N'Ш', 'SH')
INSERT INTO #T VALUES(N'Щ', 'SHCH')
INSERT INTO #T VALUES(N'Ъ', '')
INSERT INTO #T VALUES(N'Ы', 'Y')
INSERT INTO #T VALUES(N'Ь', '''')
INSERT INTO #T VALUES(N'Э', 'E')
INSERT INTO #T VALUES(N'Ю', 'YU')
INSERT INTO #T VALUES(N'Я', 'YA')

Note that the order of INSERT statements is relevant to the transliterated result. That’s we multi-character replacements are stated first.

I chose some random words from the Russian Wikipedia as test cases:

INSERT INTO #Data (data) VALUES (N'Добро пожаловать')
INSERT INTO #Data (data) VALUES (N'Википедия')

INSERT INTO #Data (data) VALUES (N'День Конституции Российской Федерации')
INSERT INTO #Data (data) VALUES (N'География')
INSERT INTO #Data (data) VALUES (N'Америка: Канада (Квебек) — Мексика')
INSERT INTO #Data (data) VALUES (N'Философия')

Using the transliteration CTE, we get the following results:

Америка: Канада (Квебек) — Мексика Amerika: Kanada (Kvebek) — Meksika
Википедия Vikipediya
География Geografiya
День Конституции Российской Федерации Den’ Konstitucii Rossiyskoy Federacii
Добро пожаловать Dobro pozhalovat’
Философия Filosofiya

Multiple String REPLACEs with T-SQL CTEs (2)

My algorithm on replacing multiple strings in an SQL Server table works in principle, but soon after publishing the article, I noticed that it has terrible performance once you define more than just 2 or 3 text translations.

Originally I thought it would be a good idea to calculate all permutations of replacements, so that the “best” permutation of an original string would win, but as it turned it: it is not.

The core of the problem is that the LIKE operation creates a recursion to the order of n factorial! I guess that’s one of the least efficient solutions I’ve ever written.

New approach: the translation table contains a sequence identifier, and its translations are thus executed in the given order.

CREATE TABLE #T (
    level INT IDENTITY(1,1),
    old NVARCHAR(50),
    new NVARCHAR(50)
)

CREATE TABLE #Data (
	data NVARCHAR(50)
)

I named the column “level” because it is joined with the level counter of the CTE. The Identity clause causes the translations to the performed in the order of the INSERT statements.

The new CTE with ordered translations:

WITH CTE (org, calc, data, level) AS
(
    SELECT data, data, data, 1
    FROM	#Data
	
    UNION ALL
	
    SELECT CTE.org, CTE.data, 
        CONVERT(NVARCHAR(50), REPLACE(CTE.data, #T.old, #T.new)), 
        CTE.level + 1
    FROM	CTE
    INNER JOIN #T ON #T.level = CTE.level
)
SELECT DISTINCT org, data
FROM CTE
WHERE level = 
    (SELECT MAX(LEVEL) FROM CTE c WHERE CTE.org = c.org)

Protecting SQL Server Data from Accidental Deletion

In nearly every project that I worked on in recent years, the database not only stores the data maintained by the application, but also describes (parts of) the application itself. If you ever had to implement a permission system which grants users to view or edit tables or open forms or execute functions, you already know that.

The resulting problem is that if the application relies on certain key data to be present and correct, accidental modification or deletion of that data usually causes the application to fail.

I try to show how to use triggers to prevent accidental data modification.

Prevent table data deletion

The simplest way to prevent data deletion is to have an INSTEAD OF DELETE trigger which does nothing, or simply raises an error:

CREATE TRIGGER [dbo].[Prevent_Foo_Delete]
    ON [dbo].[Foo]
    INSTEAD OF DELETE
AS
BEGIN
    SET NOCOUNT ON;
    RAISERROR('You cannot delete Foo',1,0);
END

If you really need to delete data, use the DISABLE TRIGGER (2005, 2008) and ENABLE TRIGGER (2005, 2008) commands.

Conditional deletion prevention

In this case, deletion should only be allowed under certain conditions. For example, we could allow to only delete single records:

CREATE TRIGGER [dbo].[Prevent_Foo_Multi_Delete]
   ON  [dbo].[Foo]
   INSTEAD OF DELETE
AS
BEGIN
    SET NOCOUNT ON;

    IF (SELECT COUNT(ID) FROM deleted) > 1
        RAISERROR('You can only delete a single Foo',1,0);
    ELSE
        DELETE Foo
        FROM Foo INNER JOIN deleted ON Foo.ID = deleted.ID

Similarly, one could prevent the deletion of detail records to only a single master record by writing

    IF (SELECT COUNT(DISTINCT BAR_ID) FROM deleted) > 1

Preventing Modifications

The same method can be used for UPDATE triggers. It may be, however, easier to define an ON UPDATE trigger to avoid rephrase the UPDATE statement in an INSTEAD OF trigger. In case of failure, we rollback the current transaction:

CREATE TRIGGER [dbo].[Prevent_Foo_Update]
   ON  [dbo].[Foo]
   FOR UPDATE
AS
BEGIN
      SET NOCOUNT ON;

      IF (SELECT COUNT(ID) FROM inserted) > 1 BEGIN
            ROLLBACK TRANSACTION
            RAISERROR('You can only modify 1 Foo',1,0);
      END
END

Preventing Truncation

These mechanisms prevent you from an accidental UPDATE or DELETE on all records (e.g. by a missing WHERE clause, or semicolon in front of the WHERE condition).

However, there is still the TRUNCATE TABLE command which deletes all data in a table and cannot be stopped by a DELETE trigger:

Because TRUNCATE TABLE is not logged, it cannot activate a trigger.

The rescue shows in the preceding sentence:

You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint

Simply have a table that references the tables to be protected:

CREATE TABLE Prevent_Truncate(
    Foo_ID INT REFERENCES Foo(ID),
    Bar_ID INT REFERENCES Bar(ID)
)

You only appreciate how valuable your data is once it’s lost 😉

Introducing oraddlscript

A recent question on StackOverflow inspired me to write a utility called oraddlscript:

How to generate scripts for Oracle schema objects, such as tables, procedures, etc.

The answers directed me to the DBMS_METADATA package (9i documentation, 10g documentation). The function DBMS_METADATA.GET_DDL is the functional equivalent of the MSSQL SMO library, which prompted me to adapt my command-line utility SMOscript to Oracle databases. Voilà, oraddlscript.

oraddlscript 0.14.3584.16268 (c) by devio.at 2009

    list and script databases and database objects.

    usage: oraddlscript [options] [command]

    options: (leading '-' or '/')

    -s server       TNS name or host name or host:port
    -svc service    service name (if host name is provided)
    -o owner        owner name

    -u username     username (default: integrated authentication)
    -p password     password (if -u is missing, password for sa)

    -f filename     output to file
    -F directory    output to directory

    -A              current ANSI codepage
    -O              ASCII
    -T              Unicode
    -U              UTF8

    commands:

    l               list objects
    s               script object/s (using dbms_meta.get*ddl)
    -xml            use dbms_meta.get*xml

    list object owners on server (implied by -s)
    list objects in database (implied by -s -o)
    script all objects (implied by -s -o -F/-f)

The command-line arguments are consistent with SMOscript, except for -d (database) which has been replaced by -o (owner name).

The list of objects is retrieved by querying DBA_OBJECTS, ALL_OBJECTS and USER_OBJECTS depending on which of the catalog views is accessible by the user specified by -u.

The package also contains a function GET_XML which is used to retrieve the XML representation of a database object.

The functions of oraddlscript are:

  • list usernames of object owners
  • list objects of specific owner
  • generate CREATE scripts of objects owned by specific user
  • generate XML files representing objects owned by specific user
  • generate one file per object or single file for all objects

Of course, logging and batch operations work just as previously described for SMOscript.

oraddlscript is available for download here.

Finding cycles in directed graphs using TSQL

I recently noticed that the Create Data Diagram function of dbscript creates rather distorted diagrams if there are cycles of more than 2 tables in table relations. However, the algorithm handles pig’s ears and reciprocal relations correctly.

So the question was, how do I detect cycles in table relations?

Regarding a directed graph, tables are equivalent to nodes, foreign key relations are equivalent to directed edges, thus the question can be rephrased as:

How do I detect cycles in a directed graph, given the graph is stored in 2 tables with this structure:

CREATE TABLE #Node (
	ID INT PRIMARY KEY,
	Name NVARCHAR(128)
)

CREATE TABLE #Edge (
	ID INT IDENTITY(1,1) PRIMARY KEY,
	NodeIDFrom INT,
	NodeIDTo INT
)

Let’s populate the tables with the following values:

INSERT INTO #Node (ID, Name) VALUES (1, 'A')
INSERT INTO #Node (ID, Name) VALUES (2, 'B1')
INSERT INTO #Node (ID, Name) VALUES (3, 'B2')
INSERT INTO #Node (ID, Name) VALUES (4, 'C1')
INSERT INTO #Node (ID, Name) VALUES (5, 'C2')
INSERT INTO #Node (ID, Name) VALUES (6, 'C3')
INSERT INTO #Node (ID, Name) VALUES (7, 'D')
INSERT INTO #Node (ID, Name) VALUES (8, 'E')

-- Bx loop
INSERT INTO #Edge(NodeIDFrom, NodeIDTo) VALUES (2, 3)
INSERT INTO #Edge(NodeIDFrom, NodeIDTo) VALUES (3, 2)
-- Cx loop
INSERT INTO #Edge(NodeIDFrom, NodeIDTo) VALUES (4, 5)
INSERT INTO #Edge(NodeIDFrom, NodeIDTo) VALUES (5, 6)
INSERT INTO #Edge(NodeIDFrom, NodeIDTo) VALUES (6, 4)
-- D, E, C1, B1
INSERT INTO #Edge(NodeIDFrom, NodeIDTo) VALUES (7, 8 )
INSERT INTO #Edge(NodeIDFrom, NodeIDTo) VALUES (8, 4)
INSERT INTO #Edge(NodeIDFrom, NodeIDTo) VALUES (4, 2)

i.e. node A is not connected to any other, the Bx nodes form a cycle (length 2), the Cx nodes form a cycle (length 3), and there is a path D-E-C1-B1.

This algorithm performs the following operations:

In table #Graph, create a list of all paths starting from all nodes, until the most recently reached nodes match the starting nodes. List creation ends when no new edges can be added to the list of paths (@@ROWCOUNT = 0):

CREATE TABLE #Graph(
	ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
	Seq INT,
	NodeID INT,
	EdgeID INT,
	NodeIDNext INT,
	NodeIDRoot int,
	IsCycle BIT DEFAULT (0)
)

INSERT INTO #Graph (Seq, NodeID, NodeIDRoot, EdgeID, NodeIDNext)
SELECT DISTINCT 1, #Node.ID, #Node.ID, #Edge.ID, #Edge.NodeIDTo
FROM #Node
INNER JOIN #Edge ON #Node.ID = #Edge.NodeIDFrom

PRINT @@ROWCOUNT

DECLARE @Seq INT, @RC INT
SET @Seq = 1
SET @RC = 1

WHILE @RC > 0 BEGIN

	UPDATE 	#Graph
	SET	IsCycle = 1
	WHERE	NodeIDRoot = NodeIDNext
	AND	Seq = @Seq

	INSERT INTO #Graph (Seq, NodeID, NodeIDRoot, EdgeID, NodeIDNext)
	SELECT DISTINCT @Seq + 1, #Edge.NodeIDFrom, #Graph.NodeIDRoot, 
		#Edge.ID, #Edge.NodeIDTo
	FROM 	#Graph
	INNER JOIN #Edge ON #Graph.NodeIDNext = #Edge.NodeIDFrom
	LEFT OUTER JOIN #Graph ex ON ex.NodeIDRoot = #Graph.NodeIDRoot 
		AND ex.EdgeID = #Edge.ID
	WHERE 	#Graph.IsCycle = 0
	AND 	#Graph.Seq = @Seq
	AND 	ex.ID IS NULL

	SET @RC = @@ROWCOUNT

	PRINT CONVERT(VARCHAR, @Seq) + ': ' + CONVERT(VARCHAR, @RC)
	SET @Seq = @Seq + 1
END

SELECT 	#Graph.NodeIDRoot, #Graph.Seq, #Graph.NodeID, 
		#Graph.NodeIDNext, #Graph.EdgeID, #Node.Name, Node2.Name
FROM 	#Graph
INNER JOIN #Node ON #Graph.NodeID = #Node.ID
INNER JOIN #Node Node2 ON #Graph.NodeIDNext = Node2.ID
WHERE 	IsCycle = 1
ORDER BY #Graph.NodeIDRoot, #Graph.Seq

The IsCycle attribute marks paths that reached its starting node (NodeIDRoot = NodeIDNext).

Next, we collect all final steps that led to cycle detection:

CREATE TABLE #Cycles(
	ID INT IDENTITY(1,1) PRIMARY KEY,
	CycleID INT,
	NodeIDRoot INT,
	Seq INT,
	NodeID INT,
	NodeIDNext INT,
	NodeName NVARCHAR(128),
	EdgeID INT
)

INSERT INTO #Cycles (NodeIDRoot, Seq, NodeID, NodeIDNext, NodeName, EdgeID)
SELECT NodeIDRoot, Seq, NodeID, NodeIDNext, #Node.Name, EdgeID
FROM 	#Graph
INNER JOIN #Node ON #Graph.NodeIDNext = #Node.ID
WHERE 	IsCycle = 1
ORDER BY #Graph.NodeIDRoot, #Graph.Seq

UPDATE 	#Cycles SET CycleID = ID

SELECT * FROM #Cycles
ORDER BY CycleID, Seq

SET @RC = @@ROWCOUNT

From this information we reconstruct the path taken from the starting nodes to the end nodes:

WHILE @RC > 0 BEGIN
	INSERT INTO #Cycles (CycleID, NodeIDRoot, Seq, NodeID, NodeIDNext, NodeName, EdgeID)
	SELECT DISTINCT #Cycles.CycleID, #Cycles.NodeIDRoot, #Graph.Seq, 
		#Graph.NodeID, #Graph.NodeIDNext, #Node.Name, #Graph.EdgeID
	FROM #Cycles
	INNER JOIN #Graph ON #Cycles.NodeIDRoot = #Graph.NodeIDRoot
		AND #Cycles.Seq -1 = #Graph.Seq
		AND #Cycles.NodeID = #Graph.NodeIDNext
	INNER JOIN #Node ON #Graph.NodeIDNext = #Node.ID
	LEFT OUTER JOIN #Cycles ex ON #Cycles.CycleID = ex.CycleID
		AND ex.NodeIDRoot = #Cycles.NodeIDRoot
		AND ex.Seq = #Graph.Seq
		AND ex.NodeID = #Graph.NodeID
		AND ex.NodeIDNext = #Graph.NodeIDNext
		AND ex.EdgeID = #Graph.EdgeID
	WHERE ex.ID IS NULL
	ORDER BY #Cycles.NodeIDRoot, #Graph.Seq

	SET @RC = @@ROWCOUNT
END

SELECT * FROM #Cycles
ORDER BY CycleID, Seq

The last query lists the cycles:
Cycle 1: B2 – B1
Cycle 2: B1 – B2
Cycle 3: C2 – C3 – C1
Cycle 4: C3 – C1 – C2
Cycle 5: C1 – C2 – C3

Finally, we clean up the temporary tables

DROP TABLE #Edge
DROP TABLE #Node
DROP TABLE #Graph
DROP TABLE #Cycles

Parsing SQL into XML and C# Objects

I recently wrote about my attempts to parse SQL statements and extract schema information from them. (dbscript already has a SQL parser, but it needs to be replaced to support different SQL dialects)

As it turns out, code generation is the way to go!

I start out with an object-oriented representation of the grammar of the target language, in this case T-SQL. A (custom-written) lexer turns the input stream into a sequence of lexemes or tokens, which the parser engine feeds into the grammar’s rule definitions.

The result of this action is an XML document whose structure is equivalent to the grammatical definition of the respective SQL statement.

SQL Parser Data Model

This is where code generation magic sets in: The grammar definition is also used to generate a C# object model, mapping each rule into a C# class. Each class implements an interface requiring a SetXml() method, which reads the contents of the instantiated object from an XmlNode structure, and thus populates its child objects recursively.

The application logic (which can be seen as sort of a “parser client”) only needs to instantiate the grammar object and the parser object, feed the parser the file, and retrieve a typed object containing an OO representation of the parsed file.

The next version of dbscript will be built on these mechanisms.

Parsing SQL into XML (2)

I recently starting re-writing my SQL parser for dbscript and reported about the first results.

Things always turn out more complicated than they initially appear, and one thing I realized was that a simple .sql file is not merely SQL, but also contains some “meta-magic” such as TSQL’s GO statement, which is not a valid SQL command, but really a batch separator for Query Analyzer or Management Studio.

Anyway, there is some kind of progress: I described all SQL statements that the AdventureWorks .sql scripts use, and parsing is pretty quick.

The sample statement of my previous post

CREATE TABLE [Production].[TransactionHistoryArchive](
    [TransactionID] [int] NOT NULL,
    [ProductID] [int] NOT NULL,
    [ReferenceOrderID] [int] NOT NULL,
    [ReferenceOrderLineID] [int] NOT NULL ,
    [TransactionDate] [datetime] NOT NULL ,
    [TransactionType] [nchar](1) COLLATE Latin1_General_CI_AS NOT NULL,
    [Quantity] [int] NOT NULL,
    [ActualCost] [money] NOT NULL,
    [ModifiedDate] [datetime] NOT NULL ,
    CONSTRAINT [PK_TransactionHistoryArchive_TransactionID] PRIMARY KEY CLUSTERED
    (
        [TransactionID] ASC
    ) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
        IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON)
    ON [PRIMARY]
) ON [PRIMARY]

is now parsed into an XML representation which looks like this:

<?xml version="1.0" encoding="Windows-1252"?>
<TSQL>
  <cmdCreateTable>
    <CREATE>CREATE</CREATE>
    <TABLE>TABLE</TABLE>
    <id>
      <id2s>
        <id2>[Production]</id2>
        <x2E>.</x2E>
        <id1>[TransactionHistoryArchive]</id1>
      </id2s>
    </id>
    <x28>(</x28>
    <columns>
      <column>
        <id>[TransactionID]</id>
        <datatype>
          <id>
            <id1s>
              <id1>[int]</id1>
            </id1s>
          </id>
        </datatype>
        <attr>
          <notnull>
            <NOT>NOT</NOT>
            <NULL>NULL</NULL>
          </notnull>
        </attr>
      </column>
      <x2C>,</x2C>
....
      <x2C>,</x2C>
      <column>
        <id>[ModifiedDate]</id>
        <datatype>
          <id>
            <id1s>
              <id1>[datetime]</id1>
            </id1s>
          </id>
        </datatype>
        <attr>
          <notnull>
            <NOT>NOT</NOT>
            <NULL>NULL</NULL>
          </notnull>
        </attr>
      </column>
      <x2C>,</x2C>
      <constraint>
        <CONSTRAINT>CONSTRAINT</CONSTRAINT>
        <constraintid>[PK_TransactionHistoryArchive_TransactionID]</constraintid>
        <type>
          <pk>
            <PRIMARY>PRIMARY</PRIMARY>
            <KEY>KEY</KEY>
            <clustered>
              <CLUSTERED>CLUSTERED</CLUSTERED>
            </clustered>
            <indexcolumns>
              <x28>(</x28>
              <indexcolumns>
                <indexcolumn>
                  <colasc>
                    <columnid>[TransactionID]</columnid>
                    <ASC>ASC</ASC>
                  </colasc>
                </indexcolumn>
              </indexcolumns>
              <x29>)</x29>
            </indexcolumns>
            <with>
              <WITH>WITH</WITH>
              <x28>(</x28>
              <params>
                <param>
                  <key>PAD_INDEX</key>
                  <x3D>=</x3D>
                  <value>OFF</value>
                </param>
                <x2C>,</x2C>
....
                <x2C>,</x2C>
                <param>
                  <key>ALLOW_PAGE_LOCKS</key>
                  <x3D>=</x3D>
                  <value>ON</value>
                </param>
              </params>
              <x29>)</x29>
            </with>
            <onfilegroup>
              <ON>ON</ON>
              <filegroupid>[PRIMARY]</filegroupid>
            </onfilegroup>
          </pk>
        </type>
      </constraint>
    </columns>
    <x29>)</x29>
    <onfilegroup>
      <ON>ON</ON>
      <filegroupid>[PRIMARY]</filegroupid>
    </onfilegroup>
  </cmdCreateTable>
</TSQL>

The complete result file is here.

So as I manage to complete parse and represent the contents of a SQL statement, the question remains how to evaluate the result:

  • Analyze internal representation (messy)
  • Analyze XML representation (complicated)
  • Trigger events for each parsed entity and their attributes (complicated)

The problem is always that the evaluating routine seems to be de-coupled from the parser.

The first solution is too much dependent on internal representation. Changing a little aspect of the parsing mechanism or the grammar would result in breaking code.

The other two solutions are less dependent, but still a change in the grammar needs to be reflected in the evaluator, without any immediate error reporting if the evaluator does not match the parser.

I read about Xtext the other day, a DSL parser for Eclipse, which seems to include a code generator that generates Java classes to match the grammar. An interesting solution, since a change in the grammar and subsequent code generation would immediatly break compilation of the evaluator.

To be continued… 😉

Parsing SQL into XML

dbscript has a built-in SQL script parser so that users can upload SQL scripts and analyze and document their database schemas.

While the parser is good enough to both extract all the relevant information from a SQL script and also skip over unsupported commands or simply plain text, I felt that the current approach would not scale well if I wanted to support more commands or different SQL dialects.

I started to play with a couple of C# classes, re-using the original dbscript T-SQL tokenizer, and soon had a small class library to define terminals, rules, and a parser class that kept everything together.

Additionally, the parser result can be output as XML.

Some parts of the parsing process turned out to be rather tough, such as the Repeater (or Loop) construct (as I said, it was playing rather than planning ahead), but I guess I now have a good framework to start implementing a T-SQL parser.

A table definition taken out of MS AdventureWorks

CREATE TABLE [Production].[TransactionHistoryArchive](
 [TransactionID] [int] NOT NULL,
 [ProductID] [int] NOT NULL,
 [ReferenceOrderID] [int] NOT NULL,
 [ReferenceOrderLineID] [int] NOT NULL ,
 [TransactionDate] [datetime] NOT NULL ,
 [TransactionType] [nchar](1) COLLATE Latin1_General_CI_AS NOT NULL,
 [Quantity] [int] NOT NULL,
 [ActualCost] [money] NOT NULL,
 [ModifiedDate] [datetime] NOT NULL ,
 CONSTRAINT [PK_TransactionHistoryArchive_TransactionID] PRIMARY KEY CLUSTERED
(
 [TransactionID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF,
  ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

is parsed and translated into XML like this:

<TSQL><cmdCREATETABLE><CREATE>CREATE</CREATE><TABLE>TABLE</TABLE>
<id13a><id2s><id2>[Production]</id2><x2E>.</x2E>
<id1>[TransactionHistoryArchive]</id1></id2s></id13a>
<columnsr><column><colid><id1>[TransactionID]</id1></colid>
<datatype><id1s><id1>[int]</id1></id1s></datatype>
<attr><notnull><NOT>NOT</NOT><NULL>NULL</NULL></notnull><rep-end>,</rep-end></attr>
</column>
...
</columnsr><x29>)</x29>
<onfg><ON>ON</ON><filegroup><id1>[PRIMARY]</id1></filegroup></onfg>
</cmdCREATETABLE></TSQL>

See the whole generated XML file here.

A “GO” command is transformed into this XML document:

<TSQL><cmdGO><GO>GO</GO></cmdGO></TSQL>

This is a work-in-progress. The naming of the symbols is not final, and the parser code needs to be cleaned up and separated into parsing and TSQL functionality.