Anonymizing Personal Data

Databases used to collect all the data ever entered, and running out of disk space was solved by adding some new storage device.

Then came GDPR and companies storing personal data (or personally identifyable data) suddenly had to consider Data Deletion Policies and Data Retention Policies.

The easiest way is of course to delete the records storing obsolete personal data. But sometimes (read: almost always) it is not possible to delete the record, because the mere existence of a record may be important for historical or statistical reasons (e.g. the query “number of purchases per time interval and region” would almost certainly refer to a person and their location).

So it’s more realistic that certain data of a person is kept, some is erased, and the personal identifiers (first name, last name) are simply anonymized or pseudonymized.

What would be easier to replace the name fields with an arbitrary combination of letters?

To make the pseudonyms more beautiful, it would be nice if we considered a typical syllable structure, the most basic would be C-V-C (consonant vowel consonant), or any extension thereof: C-V-C-V-C, C-V-C-C-V, etc.

So we define the letters for consonants and vowels as

DECLARE @cons NVARCHAR(100) SET @cons = 'bcdfghjklmnpqrstvwxyz'
DECLARE @vow NVARCHAR(100) SET @vow = 'aeiouäöüaeiouaeiou'

and select a random consonant or vowel using the expressions

SUBSTRING(@cons, (ABS(CHECKSUM(NEWID())) % LEN(@cons))+1, 1)
SUBSTRING(@vow, (ABS(CHECKSUM(NEWID())) % LEN(@vow))+1, 1)

We use the function ABS(CHECKSUM(NEWID())) to generate random numbers rather than RAND() because this expression is evaluated for every record in a multi-record statement, rather than once. To illustrate the difference, compare the result of the query

SELECT RAND(), ABS(CHECKSUM(NEWID())) FROM sys.objects

Note that by repeating a letter inside either strings you can change the relative frequency of that letter in the resulting name.

By prepending ‘Zy’ to the generated string we can make sure that the generated names are easily recognize, and occur at the end of lists.

Replacing a persons name with a generated pseudonym may be as easy as this code:

DECLARE @cons NVARCHAR(100) SET @cons = 'bcdfghjklmnpqrstvwxyz'
DECLARE @vow NVARCHAR(100) SET @vow = 'aeiouäöüaeiouaeiou'
DECLARE @cons1 NVARCHAR(100) SET @cons1 = 'bcdfghjklmnpqrstvw'
DECLARE @vow1 NVARCHAR(100) SET @vow1 = 'aeiou'

UPDATE dbo.PersonTable
SET LastName = ' Zy' 
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@vow, (ABS(CHECKSUM(NewId())) % LEN(@vow))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1)
    + SUBSTRING(@vow, (ABS(CHECKSUM(NewId())) % LEN(@vow))+1, 1)
    + SUBSTRING(@cons, (ABS(CHECKSUM(NewId())) % LEN(@cons))+1, 1),
    FirstName = 'Zy'
    + SUBSTRING(@cons1, (ABS(CHECKSUM(NewId())) % LEN(@cons1))+1, 1)
    + SUBSTRING(@vow1, (ABS(CHECKSUM(NewId())) % LEN(@vow1))+1, 1)
    + SUBSTRING(@cons1, (ABS(CHECKSUM(NewId())) % LEN(@cons1))+1, 1),
WHERE [AnonymizeCondition] = 1

 

AnonymizeCondition refers to the condition you determined when a personal record needs to be anonymized, and has not yet been anonymized.

See the code in my GitHub repository with T-SQL snippets.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.