PostgreSQL Anonymizer 0.5: Generalization and k-anonymity

Posted on 2019-11-06 by Dalibo

Eymoutiers, France, November 6, 2019

Postgresql Anonymizer is an extension that hides or replaces personally identifiable information (PII) or commercially sensitive data from a PostgreSQL database.

The extension supports 3 different anonymization strategies: Dynamic Masking, In-Place Anonymization and Anonymous Dumps. It also offers a large choice of Masking Functions: Substitution, Randomization, Faking, Partial Scrambling, Shuffling, Noise Addition and Generalization.

Generalization

The idea of generalization is to replace data with a broader, less accurate value. For instance, instead of saying "Bob is 28 years old", you can say "Bob is between 20 and 30 years old". This is interesting for analytics because the data remains true while avoiding the risk of re-identification.

PostgreSQL can handle generalization very easily with the RANGE data types, a very powerful way to store and manipulate a set of values contained between a lower and an upper bound.

Here's a basic table containing medical data:

SELECT * FROM patient;
     ssn     | firstname | zipcode |   birth    |    disease    
-------------+-----------+---------+------------+---------------
 253-51-6170 | Alice     |   47012 | 1989-12-29 | Heart Disease
 091-20-0543 | Bob       |   42678 | 1979-03-22 | Allergy
 565-94-1926 | Caroline  |   42678 | 1971-07-22 | Heart Disease
 510-56-7882 | Eleanor   |   47909 | 1989-12-15 | Acne

We want the anonymized data to remain true because it will be used for statistics. We can build a view upon this table to remove useless columns and generalize the indirect identifiers (zipcode and birthday):

CREATE MATERIALIZED VIEW generalized_patient AS
SELECT
  'REDACTED'::TEXT AS firstname,
  anon.generalize_int4range(zipcode,1000) AS zipcode,
  anon.generalize_daterange(birth,'decade') AS birth,
  disease
FROM patient;

This will give us a less accurate view of the data:

SELECT * FROM generalized_patient;
 firstname |    zipcode    |          birth          |    disease    
-----------+---------------+-------------------------+---------------
 REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease
 REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Allergy
 REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Heart Disease
 REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Acne

k-anonymity

k-anonymity is an industry-standard term used to describe a property of an anonymized dataset. The k-anonymity principle states that within a given dataset, any anonymized individual cannot be distinguished from at least k-1 other individuals. In other words, k-anonymity might be described as a "hiding in the crowd" guarantee. A low value of k indicates there's a risk of re-identification using linkage with other data sources.

You can evaluate the k-anonymity factor of a table in 2 steps :

1/ First defined the columns that are [indirect idenfiers] ( also known as "quasi identifers") like this:

SECURITY LABEL FOR anon ON COLUMN generalized_patient.zipcode 
IS 'INDIRECT IDENTIFIER';

SECURITY LABEL FOR anon ON COLUMN generalized_patient.birth 
IS 'INDIRECT IDENTIFIER';

2/ Once the indirect identifiers are declared :

SELECT anon.k_anonymity('generalized_patient')

In the example above, the k-anonymity factor of the generalized_patient materialized view is 2.

Lorem Ipsum

For TEXT and VARCHAR columns, you can now use the classic Lorem Ipsum generator:

  • anon.lorem_ipsum() returns 5 paragraphs
  • anon.lorem_ipsum(2) returns 2 paragraphs
  • anon.lorem_ipsum( paragraphs := 4 ) returns 4 paragraphs
  • anon.lorem_ipsum( words := 20 ) returns 20 words
  • anon.lorem_ipsum( characters := 7 ) returns 7 characters

How to Install

This extension is officially supported on PostgreSQL 9.6 and later.

On Red Hat / CentOS systems, you can install it from the

$ yum install postgresql_anonymizer12

Then add 'anon' in the shared_preload_libraries parameter of your postgresql.conf file. And restart your instance.

For other system, check out the install documentation :

https://postgresql-anonymizer.readthedocs.io/en/latest/INSTALL/

WARNING: The project is at an early stage of development and should be used carefully.

Thanks

This release includes code and ideas from Travis Miller, Jan Birk and Olleg Samoylov. Many thanks to them !

How to contribute

PostgreSQL Anonymizer is part of the Dalibo Labs initiative. It is mainly developed by Damien Clochard.

This is an open project, contributions are welcome. We need your feedback and ideas ! Let us know what you think of this tool, how it fits your needs and what features are missing.

If you want to help, you can find a list of Junior Jobs here:

https://gitlab.com/dalibo/postgresql_anonymizer/issues?label_name%5B%5D=Junior+Jobs

Third Party Open Source