Eymoutiers, France, Mars 5, 2020
PostgreSQL Anonymizer is an extension that hides or replaces personally
identifiable information (PII) or commercially sensitive data from a PostgreSQL
The extension supports 3 different anonymization strategies: Dynamic Masking, In-Place Anonymization and Anonymous Dumps. It also offers a large choice of Masking Functions: Substitution, Randomization, Faking, Pseudonymization, Partial Scrambling, Shuffling, Noise Addition and Generalization.
Pseudonymization functions are similar to the faking functions in the sense that they generate realistic values. The main difference is that pseudonymization is deterministic : the functions will always return the same fake value based on a seed and an optional salt.
# SELECT anon.pseudo_email('Alice','salt123'); pseudo_email ------------------- email@example.com # SELECT anon.pseudo_email('Alice','salt123'); pseudo_email ------------------- firstname.lastname@example.org
WARNING : Pseudonymization is often confused with anonymization but in fact
they serve 2 different purposes. With pseudonymization, the real data can be
rebuilt using the pseudo data and the masking rule. If an attacker
gets access to these elements, he or she can easily re-identify some people
brute force or
dictionary attacks. Therefore, you should protect any
pseudonymized data with the same level of security that the original dataset.
The GDPR makes it very clear that personal data which have undergone
pseudonymization are still considered to be personal information.
anon.dump() function was slow and unpractical. It is now deprecated and
replace by a tool named
pg_dump_anon that you can use like the regular
$ pg_dump_anon -h localhost -U bob mydb > anonymous_dump.sql
It uses the same connections parameters that
pg_dump. The PostgreSQL
environment variables ($PGHOST, PGUSER, etc.) and
.pgpass are supported.
plain format is the only supported format. The other formats
tar) are not supported.
This extension makes it very easy to declare masking rules. But of course when you're creating an anonymization strategy, the hard part is to scan the database model to find which columns contains direct and indirect identifiers and then decide how these identifiers should be masked.
We now provide a
detect() function that will search for common identifiers
names based on a dictionary. For now, 2 dictionaries are available: English
('en_US') and French ('fr_FR'). By default the English dictionary is used:
# SELECT anon.detect('en_US'); table_name | column_name | identifiers_category | direct ------------+----------------+----------------------+-------- customer | CreditCard | creditcard | t customer | id | account_id | t vendor | Firstname | firstname | t
The identifier categories are based on the HIPAA classification.
This extension is officially supported on PostgreSQL 9.6 and later.
On Red Hat / CentOS systems, you can install it from the official PostgreSQL RPM repository:
$ yum install postgresql_anonymizer12
Then add 'anon' in the
shared_preload_libraries parameter of your
postgresql.conf file. And restart your instance.
For other systems, check out the install documentation :
WARNING: The project is at an early stage of development and should be used carefully.
This release includes code, bugfixes and ideas from Sebastien Delobel, Sam Buckingham, Thomas Clark, Joe Auty, Pierre-Henri Dubois Amy and Olleg Samoylov.
Many thanks to them !
This is an open project, contributions are welcome. We need your feedback and ideas ! Let us know what you think of this tool, how it fits your needs and what features are missing.
If you want to help, you can find a list of Junior Jobs.