Re: Possible solution for masking chosen columns when using pg_dump

From: Виктория Шепард <we(dot)viktory(at)gmail(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Олег Целебровский <oleg_tselebrovskiy(at)mail(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Possible solution for masking chosen columns when using pg_dump
Date: 2022-10-07 01:01:24
Message-ID: CAAB=fdMV6rQ=Hc19Jqfo_i-Si68mv6cS6Pxcy2jFbi4wy9UNUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,
I took a look, here are several suggestions for improvement:

- Masking is not a main functionality of pg_dump and it is better to write
most of the connected things in a separate file like parallel.c or
dumputils.c. This will help slow down the growth of an already huge pg_dump
file.

- Also it can be hard to use a lot of different functions for different
fields, maybe it would be better to set up functions in a file.

- How will it work for the same field and tables in the different schemas?
Can we set up the exact schema for the field?

- misspelling in a word
>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/

- Why did you use 256 here?
> char* table = (char*) pg_malloc(256 * sizeof(char));
Also for malloc you need malloc on 1 symbol more because you have to store
'\0' symbol.

- Instead of addFuncToDatabase you can run your query using something
already defined from fe_utils/query_utils.c. And It will be better to set
up a connection only once and create all functions. Establishing a
connection is a resource-intensive procedure. There are a lot of magic
numbers, better to leave some comments explaining why there are 64 or 512.

- It seems that you are not using temp_string
> char *temp_string = (char*)malloc(256 * sizeof(char));

- Grammar issues
>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns
with this name),
>* name of masking function and name of schema containing this function
(public if not specified)
>*/
the name of its table

пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123(at)gmail(dot)com>:

> Hi,
>
> On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
> >
> > Hello, here's my take on masking data when using pg_dump
> >
> > The main idea is using PostgreSQL functions to replace data during a
> SELECT.
> > When table data is dumped SELECT a,b,c,d ... from ... query is
> generated, the columns that are marked for masking are replaced with result
> of functions on those columns
> > Example: columns name, count are to be masked, so the query will look as
> such: SELECT id, mask_text(name), mask_int(count), date from ...
> >
> > So about the interface: I added 2 more command-line options:
> >
> > --mask-columns, which specifies what columns from what tables will be
> masked
> > usage example:
> > --mask-columns "t1.name, t2.description" - both columns
> will be masked with the same corresponding function
> > or --mask-columns name - ALL columns with name "name" from
> all dumped tables will be masked with correspoding function
> >
> > --mask-function, which specifies what functions will mask data
> > usage example:
> > --mask-function mask_int - corresponding columns will be
> masked with function named "mask_int" from default schema (public)
> > or --mask-function my_schema.mask_varchar - same as above
> but with specified schema where the function is stored
> > or --mask-function somedir/filename - the function is
> "defined" here - more on the structure below
>
> FTR I wrote an extension POC [1] last weekend that does that but on the
> backend
> side. The main advantage is that it's working with any existing versions
> of
> pg_dump (or any client relying on COPY or even plain interactive SQL
> statements), and that the DBA can force a dedicated role to only get a
> masked
> dump, even if they forgot to ask for it.
>
> I only had a quick look at your patch but it seems that you left some todo
> in
> russian, which isn't helpful at least to me.
>
> [1] https://github.com/rjuju/pg_anonymize
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2022-10-07 01:04:26 Re: ExecRTCheckPerms() and many prunable partitions
Previous Message Tom Lane 2022-10-07 01:00:14 Re: Reducing the chunk header sizes on all memory context types