pg_anonymize, a new extension for simple and transparent data anonymization

Posted on 2023-03-09 by Julien Rouhaud
Related Open Source

Taipei, Taiwan - March 9, 2023

pg_anonymize, a new extension for simple and transparent data anonymization

I'm pleased to announce the beta version of pg_anonymize.

pg_anonymize is a PostgreSQL extension that provides simple, robust and transparent infrastructure for data anonymization. Its goal is to ensure that anyone connected with an anonymized role will only ever see the anonymized version of the data without any restriction on the client used (could be psql, pg_dump, your own application or even other tools like pg_sample...) or the number of schemas and relations.

Anonymization is done using a declarative approach, relying on SECURITY LABELs: you only have to declare an expression for each column that should be anonymized that will dynamically perform the anonymization. Any valid expression yielding the correct type can be used, so you can write them in plain sql, plpgsql or any other procedural language of your choice.

This extension is compatible with PostgreSQL 10 and above. For more details on its usage please look at the usage section below or the documentation.

Feel free to open issue if you encounter any problem or want to ask for new features.

Configuration

pg_anonymize provides the following configuration options:

  • pg_anonymize.enabled (bool): allows to globally enable or disable pg_anonymize. The default value is on.

  • pg_anonymize.check_labels (bool): perform sanity checks (expression validity, read-only, returned type and lack of SQL injection) on the defined expression when declaring security labels. The default value is on.

NOTE: even if pg_anonymize.check_labels is disabled, pg_anonymize will still check that the defined expression doesn't contain any SQL injection.

Usage

pg_anonymize must be loaded before being able to use it. There are multiple ways to do it. Usually, only a few roles should require data anonymization, so the recommended way is to only load the extension for such roles. For instance, assuming the role alice should be used:

ALTER ROLE alice SET session_preload_libraries = 'pg_anonymize';

NOTE: only sessions opened by alice after this command has been successfully run will load pg_anonymize.

You can alternatively load it explicitly, for instance:

LOAD 'pg_anonymize';

NOTE: LOAD requires superuser privileges.

You then need to declare the wanted role(s) as needing anonymized data. This is done adding a SECURITY LABEL anonymize on the target role(s). For instance:

-- pg_anonymize need to be loaded before declaring SECURITY LABEL LOAD 'pg_anonymize'; SECURITY LABEL FOR pg_anonymize ON ROLE alice IS 'anonymize';

NOTE: declaring a SECURITY LABEL on a role requires CREATEROLE privilege.

You can then declare how to anonymize each column with SECURITY LABELS, defining an expression to replace the actual content.

For instance, assuming a simplistic customer table:

``` CREATE TABLE public.customer(id integer, first_name text, last_name text, birthday date, phone_number text);

INSERT INTO public.customer VALUES (1, 'Nice', 'Customer', '1970-03-04', '+886 1234 5678');

GRANT SELECT ON TABLE public.customer TO alice; ```

Let's anonymize the last name, birthday and phone number:

SECURITY LABEL FOR pg_anonymize ON COLUMN public.customer.last_name IS $$substr(last_name, 1, 1) || '*****'$$; SECURITY LABEL FOR pg_anonymize ON COLUMN public.customer.birthday IS $$date_trunc('year', birthday)::date$$; SECURITY LABEL FOR pg_anonymize ON COLUMN public.customer.phone_number IS $$regexp_replace(phone_number, '\d', 'X', 'g')$$;

NOTE: declaring a SECURITY LABEL on a column requires to be owner of the underying relation.

The alice role will now automatically see anonymized data. For instance:

``` -- current role sees the normal data =# SELECT * FROM public.customer; id | first_name | last_name | birthday | phone_number ----+------------+-----------+------------+---------------- 1 | Nice | Customer | 1970-03-04 | +886 1234 5678 (1 row)

-- but alice will see anonymized data =# \c - alice You are now connected to database "rjuju" as user "alice".

=> SELECT * FROM public.customer; id | first_name | last_name | birthday | phone_number ----+------------+-----------+------------+---------------- 1 | Nice | C* | 1970-01-01 | +XXX XXXX XXXX (1 row)

-- pg_dump will also see anonymized data $ pg_dump -U alice -t public.customer -a rjuju | grep "COPY" -A2 COPY public.customer (id, first_name, last_name, birthday, phone_number) FROM stdin; 1 Nice C* 1970-01-01 +XXX XXXX XXXX . ```