Re: Anonymized database dumps

From: Janning Vygen <vygen(at)kicktipp(dot)de>
To: Bill Moran <wmoran(at)potentialtech(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Anonymized database dumps
Date: 2012-03-19 14:45:28
Message-ID: 3EC70397-81C3-42E3-B29E-69F8A9A2C084@kicktipp.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Am 19.03.2012 um 13:22 schrieb Bill Moran <wmoran(at)potentialtech(dot)com>:

> In response to Janning Vygen <vygen(at)kicktipp(dot)de>:
>>
>> I am working on postgresql 9.1 and loving it!
>>
>> Sometimes we need a full database dump to test some performance issues
>> with real data.
>>
>> Of course we don't like to have sensible data like bunches of e-mail
>> addresses on our development machines as they are of no interest for
>> developers and should be kept secure.
>>
>> So we need an anonymized database dump. I thought about a few ways to
>> achieve this.
>>
>> 1. Best solution would be a special db user and some rules which fire on
>> reading some tables and replace privacy data with some random data. Now
>> doing a dump as this special user doesn't even copy the sensible data at
>> all. The user just has a different view on this database even when he
>> calls pg_dump.
>>
>> But as rules are not fired on COPY it can't work, right?
>>
>> 2. The other solution I can think of is something like
>>
>> pg_dump | sed > pgdump_anon
>>
>> where 'sed' does a lot of magical replace operations on the content of
>> the dump. I don't think this is going to work reliable.
>>
>> 3. More reliable would be to dump the database, restore it on a
>> different server, run some sql script which randomize some data, and
>> dump it again. hmm, seems to be the only reliable way so far. But it is
>> no fun when dumping and restoring takes an hour.
>>
>> Does anybody has a better idea how to achieve an anonymized database dump?
>
> I highly recommend #3. It's how we do it where I work.
>
> At first it seems like a big, slow, complicated monster, but once you've
> built the tools and have it running reliably it's very nice. Our system
> does the dumps overnight via cron (we have over 100 production databases)
> then changes the sensitive data, as well changing all the passwords to
> "password" so developers can easily log in as any account. During the
> day, the developers have access to all the sanitized dump files and can
> use them to make as many testing databases as they need. Yes, the data
> gets up to 24 hours out of date, but it's never been a problem for us.

Thanks for your response and your insights to your process. Sounds reasonable.

Regards
Janning

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Florent THOMAS 2012-03-19 15:28:10 Conditionnal validation for transaction
Previous Message Tom Lane 2012-03-19 14:28:38 Re: WHERE IN (subselect) versus WHERE IN (1,2,3,)