Anonymizing Personal Data

Databases used to collect all the data ever entered, and running out of disk space was solved by adding some new storage device.

Then came GDPR and companies storing personal data (or personally identifyable data) suddenly had to consider Data Deletion Policies and Data Retention Policies.

The easiest way is of course to delete the records storing obsolete personal data. But sometimes (read: almost always) it is not possible to delete the record, because the mere existence of a record may be important for historical or statistical reasons (e.g. the query “number of purchases per time interval and region” would almost certainly refer to a person and their location).

So it’s more realistic that certain data of a person is kept, some is erased, and the personal identifiers (first name, last name) are simply anonymized or pseudonymized.

What would be easier to replace the name fields with an arbitrary combination of letters?

To make the pseudonyms more beautiful, it would be nice if we considered a typical syllable structure, the most basic would be C-V-C (consonant vowel consonant), or any extension thereof: C-V-C-V-C, C-V-C-C-V, etc.

So we define the letters for consonants and vowels as

DECLARE @cons NVARCHAR(100) SET @cons = 'bcdfghjklmnpqrstvwxyz'
DECLARE @vow NVARCHAR(100) SET @vow = 'aeiouäöüaeiouaeiou'

and select a random consonant or vowel using the expressions

SUBSTRING(@cons, (ABS(CHECKSUM(NEWID())) % LEN(@cons))+1, 1)
SUBSTRING(@vow, (ABS(CHECKSUM(NEWID())) % LEN(@vow))+1, 1)

We use the function ABS(CHECKSUM(NEWID())) to generate random numbers rather than RAND() because this expression is evaluated for every record in a multi-record statement, rather than once. To illustrate the difference, compare the result of the query

SELECT RAND(), ABS(CHECKSUM(NEWID())) FROM sys.objects

Note that by repeating a letter inside either strings you can change the relative frequency of that letter in the resulting name.

By prepending ‘Zy’ to the generated string we can make sure that the generated names are easily recognize, and occur at the end of lists.

Replacing a persons name with a generated pseudonym may be as easy as this code:

DECLARE @cons NVARCHAR(100) SET @cons = 'bcdfghjklmnpqrstvwxyz'
DECLARE @vow NVARCHAR(100) SET @vow = 'aeiouäöüaeiouaeiou'
DECLARE @cons1 NVARCHAR(100) SET @cons1 = 'bcdfghjklmnpqrstvw'
DECLARE @vow1 NVARCHAR(100) SET @vow1 = 'aeiou'

UPDATE dbo.PersonTable
SET LastName = ' Zy' 
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@vow, (ABS(CHECKSUM(NewId())) % LEN(@vow))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@vow, (ABS(CHECKSUM(NewId())) % LEN(@vow))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1),
    FirstName = 'Zy'
    + SUBSTRING(@cons1, (ABS(CHECKSUM(NewId())) % LEN(@cons1))+1, 1)
    + SUBSTRING(@vow1, (ABS(CHECKSUM(NewId())) % LEN(@vow1))+1, 1)
    + SUBSTRING(@cons1, (ABS(CHECKSUM(NewId())) % LEN(@cons1))+1, 1),
WHERE [AnonymizeCondition] = 1

 

AnonymizeCondition refers to the condition you determined when a personal record needs to be anonymized, and has not yet been anonymized.

See the code in my GitHub repository with T-SQL snippets.

Tracking down ScrewTurn Wiki

Back in 2012, the world was shocked by news that ScrewTurn Wiki would not be developed any further.

The universe of .Net-based open-source wikis was small then, and it is even smaller today.

Time to find out what happened to the original STW source code. I tracked down a couple of versions of the original STW3, as well as successors:

Compilation and initial startup showed the following results:

 

Version .Net Version VS Version License compiled run test
3.0.5.610 3.5 12 (2013) GPL V2 VS 2013 ok
3.0.5.629 3.5 11 (2010) GPL V2 VS 2013 ok
3.0.6.630 4.5 12 (2013) GPL V2 VS 2013 fails to load sql page provider
3.1.0.0 4.7.2 15 (2017) GPL V2 VS 2017 fails to load sql provider
4.0.5.145 4.0 11 (2010) GPL V2 VS 2017 with Azure SDK ok
6.0.0.0 4.6 14 (2015) GPL V3 VS 2017 with Azure SDK ok (run as site)

Comments:

  • VS Version is the version of Visual Studio the original code was developed with, while compiled shows the Visual Studio version I used to compile the code.
  • Versions 3.0.6.630 and 3.1.0.0 compiled and ran, but did not load the SQL Providers.
  • Versions 4 and above reference Azure libraries. Since the Azure SDK from Visual Studio 2010 is not available anymore, I used Visual Studio 2017 with its Azure SDK.
  • Version 6 is an ASP.Net MVC application. All previous versions are ASP.Net.
  • Version 6 requires ScrewTurn Wiki to run as an IIS site due to lack of application-relative paths in code and markup. All previous versions also run as IIS applications inside an IIS site.

The great Database Diagram mystery

You may (or may not) know that the Database Diagrammer of SSMS stores the created diagrams in a table called dbo.sysdiagrams.

This table has always been a mystery to me, especially the column “definition”, as it contains binary data, and documentation cannot be found about its format.

So when I tried to figure out what all the bytes meant, and my standard editor plus hex viewer plug-in (read: Notepad++) could not help me any further, I searched and found FlexHEX which promised to provide the functionality to annotate and structure binary data. Somehow this did not work out as expected – I could not figure out how to create and edit “structures” – but as I clicked around, I hit the menu “Open OLE Compound File…” as a last resort. What could go wrong.

flexhex compound file.png

To my great surprise, this command actually opened the file and displayed a file structure, with each of the sections containing some data.

As it turned out, the section DSREF-SCHEMA-CONTENTS was the one I was looking for, as it contains, as I later found out, all the textual information in the diagram:

  •  (original) diagram name
  • connection string
  • list of all tables in the diagram (as of last save)

flexhex data diagram.png

This section is stored as a Microsoft Compound File or Compound File Binary Format or Compound Document File, and the easiest way to process it in C# is the nuget package OpenMcdf (source on GitHub).

Writing a small tool, I verified that the sections from the binary data really have those names, and are not artifacts of the hex editor.

Once I got hold of the data contained in section DSREF-SCHEMA-CONTENTS, making sense of the structure of this data was quite straight-forward.

SQL Server Sysobjects Types (Version 2019)

MS SQL Server stores the information about the objects (tables, views, functions, etc) stored in the database in a table, which is accessible by selecting from the sys.objects view. The type of object is given by the column “type”

I collected this cross-version information from Microsoft Docs online, but fixed a couple of historical mistakes there.

Object type: 2008 2012 2016 2019
sys.objects.type
AF Aggregate function (CLR) x x x x
C CHECK constraint x x x x
D DEFAULT (constraint or stand-alone) x x x x
EC Edge constraint x
ET External Table x x
F FOREIGN KEY constraint x x x x
FN SQL scalar function x x x x
FS Assembly (CLR) scalar function x x x x
FT Assembly (CLR) table-valued function x x x x
IF SQL inline table-valued function x x x x
IT Internal table x x x x
P SQL stored procedure x x x x
PC Assembly (CLR) stored procedure x x x x
PG Plan guide x x x x
PK PRIMARY KEY constraint x x x x
R Rule (old-style, stand-alone) x x x x
RF Replication-filter-procedure x x x x
S System base table x x x x
SN Synonym x x x x
SO Sequence object x x x
SQ Service queue x x x x
TA Assembly (CLR) DML trigger x x x x
TF SQL table-valued-function x x x x
TR SQL DML trigger x x x x
TT Table type x x x x
U Table (user-defined) x x x x
UQ UNIQUE constraint x x x x
V View x x x x
X Extended stored procedure x x x x

The sys.objects.type_desc column contains the description corresponding to the type:

type sys.objects.type_desc
AF AGGREGATE_FUNCTION
C CHECK_CONSTRAINT
D DEFAULT_CONSTRAINT
EC
ET
F FOREIGN_KEY_CONSTRAINT
FN SQL_SCALAR_FUNCTION
FS CLR_SCALAR_FUNCTION
FT CLR_TABLE_VALUED_FUNCTION
IF SQL_INLINE_TABLE_VALUED_FUNCTION
IT INTERNAL_TABLE
P SQL_STORED_PROCEDURE
PC CLR_STORED_PROCEDURE
PG PLAN_GUIDE
PK PRIMARY_KEY_CONSTRAINT
R RULE
RF REPLICATION_FILTER_PROCEDURE
S SYSTEM_TABLE
SN SYNONYM
SO SEQUENCE_OBJECT
SQ SERVICE_QUEUE
TA CLR_TRIGGER
TF SQL_TABLE_VALUED_FUNCTION
TR SQL_TRIGGER
TT TYPE_TABLE
U USER_TABLE
UQ UNIQUE_CONSTRAINT
V VIEW
X EXTENDED_STORED_PROCEDURE

My previous post on this topic only covered SQL Server up to version 2008.

Length of UTF-8 VARCHAR in SQL Server

Foreword

For as long as I can remember, a VARCHAR (or CHAR) was always defined as “1 character equals 1 byte”. Different character sets (code pages) where implemented as COLLATIONs, so that you had basic database support for internationalization.

Then came Unicode, and we got NVARCHAR strings (or NCHAR), where the rule was “1 character equals 2 bytes”, and we could store any text from around the world without bothering with code pages, encodings, etc. The .Net framework brought us the string class with similar features and the world was beautiful.

Then, in 2001, came Unicode 3.1 and needed more space:

For the first time, characters are encoded beyond the original 16-bit codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the forthcoming standard ISO/IEC 10646-2. For further information, see Article IX, Relation to 10646. Unicode 3.1 and 10646-2 define three new supplementary planes.

These additional planes were immediately supported in SQL Server 2012. From now on, using an *_SC collation, NVARCHARs could be 2 or 4 bytes per character.

In C#, the StringInfo class handles supplementary planes, but it seems, they are still a bit behind:

Starting with the .NET Framework 4.6.2, character classification is based on The Unicode Standard, Version 8.0.0. For the .NET Framework 4 through the .NET Framework 4.6.1, it is based on The Unicode Standard, Version 6.3.0. In .NET Core, it is based on The Unicode Standard, Version 8.0.0.

(For the record, the current Unicode version is 12.1, and 13.0 is going to be released soon)

UTF-8 Collations

So now SQL Server 2019 supports UTF-8-enabled collations.

A question on SO quoted the documentation as

A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored

(emphasis mine) which confused me a little bit, and the quote continues

The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n.

(emphasis mine).

This got me investigating, and I had a look into this issue. I create a UTF8-enabled database with a table with all kinds of N/VARCHAR columns

CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8

CREATE TABLE [dbo].[UTF8Test](
  [Id] [int] IDENTITY(1,1) NOT NULL,
  [VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL,
  [VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
  [VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL,
  [NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL,
  [NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
  [NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL
)

I inserted test data from various Unicode ranges

INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES ('a','a','a','a','a','a')
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES ('ö','ö','ö',N'ö',N'ö',N'ö')
-- U+56D7
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗')
-- U+2000B
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
    VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')

then selected the lengths and data lengths of each text field

SELECT TOP (1000) [Id]
    ,[VarcharText],[VarcharTextSC],[VarcharUTF8]
    ,[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
    ,LEN([VarcharText]) VT,LEN([VarcharTextSC]) VTSC
    ,LEN([VarcharUTF8]) VU
    ,LEN([NVarcharText]) NVT,LEN([NVarcharTextSC]) NVTSC
    ,LEN([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
    ,DATALENGTH([VarcharText]) VT,DATALENGTH([VarcharTextSC]) VTSC
    ,DATALENGTH([VarcharUTF8]) VU
    ,DATALENGTH([NVarcharText]) NVT,DATALENGTH([NVarcharTextSC]) NVTSC
    ,DATALENGTH([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]

Select Lengths.png

I was surprised to find that the old mantra “a VARCHAR only stores single byte characters” needs to be revised when using UTF8 collations.

Table data only

Note that only table columns are associated with collations, but not T-SQL variables, as you cannot declare a collation on a variable

SELECT @VarcharText = [VarcharText],@NVarcharText = [NVarcharText]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

SELECT @VarcharText = [VarcharTextSC], @NVarcharText = [NVarcharTextSC]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

SELECT @VarcharText = [VarcharUTF8], @NVarcharText = [NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText)
    , @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

 

Select Variable Lengths.png

 

Retrieving Active GitHub Forks

The problem with GitHub repositories is that, whenever there exists forks for a given repository, it is not immediately obvious which forks are active, containing newer commits, and which forks are just-for-fun forks, which should have been Starred rather than Forked.

Of course, you can always navigate to that forked repository, and check for “commits ahead” and “commits behind”, but that’s really painful.

Apparently there are others that think such a feature would be useful

but the functionality is still not built-in. Both tickets link to helper applications or scripts that might perform the task, though.

So I came across Active GitHub Forks which lists Stars, Forks and Last Push date for each forked repository of a given base repository. (Their are also other solutions such as this Active GitHub Forks page)

So I forked this repository, had a little look on the GitHub API, I learned that you can query the forks of a given repository, sorted by stars or newest or oldest, but the list of forked repositories does not contain the relevant attributes.

Instead, you have to query each of the forked repositories separately to retrieve their attributes “total_commits”, “ahead_by”, “behind_by”. And each separate query weighs on GitHub’s rate limit. (No wonder then that some of the tools screenscrape rather than query the API)

Visit my solution of Active GitHub Forks and query the forked repositories individually.

Active GitHub Forks screenshot

Updating YuJisho: a Unicode CJK Character web dictionary

My online CJK dictionary YuJisho got a facelift again – this time, from ASP.Net MVC3 to the current MVC 5, and from Bootstrap 2 to Bootstrap 4.

I hope this gets rid of the messages in the Google Search Console 😉 :

  •  Text too small to read
  • Clickable elements too close together
  • Viewport not set

There is a little change though: In times of GDPR &co, queries to Wikipedias and Wiktionaries need to invoked by clicking the “Query Wikipedia” button, rather than querying automatically.

click the button

click the button “Query Wikipedia”

results in links to various Wikipedias containing an article

If your browser / operating system fails to display certain Chinese characters, there is now a button “Load Glyphs” which tries to load the unsupported characters’ images as .svg from GlyphWiki.

after load glyphs

after load glyphs u

Please check the About page for more information.

Essential NuGet packages for ASP.Net MVC

Here are some useful packages I came across in recent years:

Topic Package NuGet Source
Logging
log4net nuget apache
Error Logger
elmah.corelibrary nuget github
Elmah.Mvc nuget github
ElmahAppender.log4net nuget github
Profiling
Glimpse nuget github
Glimpse.AspNet nuget
Glimpse.Mvc5 nuget
NHibernate.Glimpse nuget github
Minifier, Bundler
BundleTransformer.Core nuget github
BundleTransformer.Yui nuget github
Data Access
Nhibernate nuget github
Identity Management
NHibernate.AspNet.Identity nuget github
Templating
Handlebars.Net nuget github
RazorEngine nuget github
Documents
EPPlus nuget github
CsvHelper nuget github
PdfSharp nuget pdfsharp
HtmlRenderer.PdfSharp nuget github
Other
DocuVieware Lite
HTML5 Document Viewer For ASP.NET
homepage

SMOscript for Windows 10

I recently got notified that my tool SMOscript raises an error message when run on Windows 10.

I had a look at the problem, and indeed, as soon as the enumeration of database objects (e.g. using commands l (list) or s (script)) is to process a table, the program stop. In case of the script command, an error message is issued

Error: iterating objects: Attempt to retrieve data for object failed for Server ”.

The most recent version of SMOscript was still targeted on .Net 2.0, and used the SMO 100 assemblies. Time to upgrade! 😉

So I changed the .Net version to 4.5 and added the nuget SMO package and changed the .Net version to 4.7 to make the compiler happy.

Tests were performed successfully against SQL Server 2014 and 2019, but now SQL Server 2008 caused an error that I could work-around.

The tool now also supports options to sort database objects by schema and name, or by type and schema and name, rather than relying on the order resulting from the SMO method Database.EnumObjects().

Additionally, the -st (Sort Types) parameter allows to specify a custom sort order for database object types (rather than alphabetically), so that the parameter

smoscript -st Table,View

will process objects of types table and view first, and everything else afterwards. The parameter -st without a list of types will display the available types.

The latest version of SMOscript is available for download here.

Note that there is a regular build “V 0.40”, but also a build using the original SMO 100 assemblies.

smoscript 0.40 command-line parameters

smoscript 0.40 sort SMO Types