Re: Re[2]: Possible solution for masking chosen columns when using pg_dump

From: Виктория Шепард <we(dot)viktory(at)gmail(dot)com>
To: Олег Целебровский <oleg_tselebrovskiy(at)mail(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Julien Rouhaud <rjuju123(at)gmail(dot)com>
Subject: Re: Re[2]: Possible solution for masking chosen columns when using pg_dump
Date: 2022-10-23 23:24:38
Message-ID: CAAB=fdNdtsDQV-rKeKtDQatbzAaDeN-DDaMwHPAs2BJBwdjrLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thank you, Oleg Tselebrovskiy, for your valuable review, here are the fixes

Best regards,
Viktoria Shepard

ср, 12 окт. 2022 г. в 12:19, Виктория Шепард <we(dot)viktory(at)gmail(dot)com>:

> Hi,
>
> Here is an idea of how to read masking options from a file. Please, take a
> look.
>
> пн, 10 окт. 2022 г. в 14:54, Олег Целебровский <oleg_tselebrovskiy(at)mail(dot)ru
> >:
>
>> Hi,
>>
>> I applied most of suggestions: used separate files for most of added
>> code, fixed typos/mistakes, got rid of that pesky TODO that was already
>> implemented, just not deleted.
>>
>> Added tests (and functionality) for cases when you need to mask columns
>> in tables with the same name in different schemas. If schema is not
>> specified, then columns in all tables with specified name are masked
>> (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will
>> mask all ids in all tables with names ‘t0’ in all existing schemas).
>>
>> Wrote comments for all ‘magic numbers’
>>
>> About that
>>
>> >- Also it can be hard to use a lot of different functions for different
>> fields, maybe it would be better to set up functions in a file.
>>
>> I agree with that, but I know about at least 2 other patches (both are
>> WIP, but still) that are interacting with reading command-line options from
>> file. And if everyone will write their own version of reading command-line
>> options from file, it will quickly get confusing.
>>
>> A solution to that problem is another patch that will put all options
>> from file (one file for any possible options, from existing to future ones)
>> into **argv in main, so that pg_dump can process them as if they came form
>> command line.
>>
>>
>> Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <
>> we(dot)viktory(at)gmail(dot)com>:
>>
>> Hi,
>> I took a look, here are several suggestions for improvement:
>>
>> - Masking is not a main functionality of pg_dump and it is better to
>> write most of the connected things in a separate file like parallel.c or
>> dumputils.c. This will help slow down the growth of an already huge pg_dump
>> file.
>>
>> - Also it can be hard to use a lot of different functions for different
>> fields, maybe it would be better to set up functions in a file.
>>
>> - How will it work for the same field and tables in the different
>> schemas? Can we set up the exact schema for the field?
>>
>> - misspelling in a word
>> >/*
>> >* Add all columns and funcions to list of MaskColumnInfo structures,
>> >*/
>>
>> - Why did you use 256 here?
>> > char* table = (char*) pg_malloc(256 * sizeof(char));
>> Also for malloc you need malloc on 1 symbol more because you have to
>> store '\0' symbol.
>>
>> - Instead of addFuncToDatabase you can run your query using something
>> already defined from fe_utils/query_utils.c. And It will be better to set
>> up a connection only once and create all functions. Establishing a
>> connection is a resource-intensive procedure. There are a lot of magic
>> numbers, better to leave some comments explaining why there are 64 or 512.
>>
>> - It seems that you are not using temp_string
>> > char *temp_string = (char*)malloc(256 * sizeof(char));
>>
>> - Grammar issues
>> >/*
>> >* mask_column_info_list contains info about every to-be-masked column:
>> >* its name, a name its table (if nothing is specified - mask all columns
>> with this name),
>> >* name of masking function and name of schema containing this function
>> (public if not specified)
>> >*/
>> the name of its table
>>
>>
>> пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123(at)gmail(dot)com
>> <//e.mail.ru/compose/?mailto=mailto%3arjuju123(at)gmail(dot)com>>:
>>
>> Hi,
>>
>> On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>> >
>> > Hello, here's my take on masking data when using pg_dump
>> >
>> > The main idea is using PostgreSQL functions to replace data during a
>> SELECT.
>> > When table data is dumped SELECT a,b,c,d ... from ... query is
>> generated, the columns that are marked for masking are replaced with result
>> of functions on those columns
>> > Example: columns name, count are to be masked, so the query will look
>> as such: SELECT id, mask_text(name), mask_int(count), date from ...
>> >
>> > So about the interface: I added 2 more command-line options:
>> >
>> > --mask-columns, which specifies what columns from what tables will be
>> masked
>> > usage example:
>> > --mask-columns "t1.name, t2.description" - both columns
>> will be masked with the same corresponding function
>> > or --mask-columns name - ALL columns with name "name" from
>> all dumped tables will be masked with correspoding function
>> >
>> > --mask-function, which specifies what functions will mask data
>> > usage example:
>> > --mask-function mask_int - corresponding columns will be
>> masked with function named "mask_int" from default schema (public)
>> > or --mask-function my_schema.mask_varchar - same as above
>> but with specified schema where the function is stored
>> > or --mask-function somedir/filename - the function is
>> "defined" here - more on the structure below
>>
>> FTR I wrote an extension POC [1] last weekend that does that but on the
>> backend
>> side. The main advantage is that it's working with any existing versions
>> of
>> pg_dump (or any client relying on COPY or even plain interactive SQL
>> statements), and that the DBA can force a dedicated role to only get a
>> masked
>> dump, even if they forgot to ask for it.
>>
>> I only had a quick look at your patch but it seems that you left some
>> todo in
>> russian, which isn't helpful at least to me.
>>
>> [1] https://github.com/rjuju/pg_anonymize
>>
>>
>>
>>
>>
>

Attachment Content-Type Size
0001-C4a-pg-dump-masking-option.patch text/x-patch 64.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-10-23 23:49:16 Re: Multiple grouping set specs referencing duplicate alias
Previous Message Maciek Sakrejda 2022-10-23 22:48:09 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)