Possible solution for masking chosen columns when using pg_dump

From: Олег Целебровский <oleg_tselebrovskiy(at)mail(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Possible solution for masking chosen columns when using pg_dump
Date: 2022-10-03 15:30:17
Message-ID: 1664811017.111625872@f404.i.mail.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello, here's my take on masking data when using pg_dump
 
The main idea is using PostgreSQL functions to replace data during a SELECT.
When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
 
So about the interface: I added 2 more command-line options: 
 
--mask-columns, which specifies what columns from what tables will be masked 
    usage example:
            --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
            or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
 
--mask-function, which specifies what functions will mask data
    usage example:
            --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
            or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
            or --mask-function somedir/filename - the function is "defined" here - more on the structure below
 
Structure of the file with function description:
 
First row - function name (with or without schema name)
Second row - type of in and out value (the design is to only work with same input/output type so no int-to-text shenanigans)
Third row - language of function
Forth and later rows - body of a function
 
Example of such file:
 
mask_text
text
plpgsql
res := '***';
 
First iteration of using file-described functions used just plain SQL query, but since it executed during read-write connection, some things such as writing "DROP TABLE t1;" after the CREATE FUNCTION ...; were possible.
Now even if something harmful is written in function body, it will be executed during dump-read-only connection, where it will just throw an error
 
About "corresponding columns and functions" - masking functions and columns are paired with eachother based on the input order, but --masking-columns and --masking-functions don't have to be subsequent.
Example: pg_dump -t table_name --mask-columns name --mask-colums count --mask-function mask_text --mask-function mask_int - here 'name' will be paired with function 'mask_text' and 'count' with 'mask_int' 
 
Patch includes regression tests
 
I'm open to discussion of this patch
 
Best regards,
 
Oleg Tselebrovskiy

Attachment Content-Type Size
masking_for_pg_dump_v1.patch text/x-diff 24.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-10-03 15:44:53 Re: Possible solution for masking chosen columns when using pg_dump
Previous Message vignesh C 2022-10-03 15:11:07 Miscellaneous tab completion issue fixes