From: | Kirk Wolak <wolakk(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Cc: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Nikolay Samokhvalov <nik(at)postgres(dot)ai>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Pavlo Golub <Pavlo(dot)Golub(at)cybertec(dot)at> |
Subject: | PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables) |
Date: | 2025-08-15 16:10:58 |
Message-ID: | CACLU5mS07WGPpq6=m8aC5tUMusNpL8FhBPnYem86iSawEgcavw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hackers,
I am asking for feedback on this. We just migrated a client to PG and
all is well. Except the developers requested a copy of recent production
data for testing.
We have a handful of tables that have 90% of the data going back 30 years.
We NEVER restore this data to Dev or Staging. We used a special RMAN
backup where these tables had a "WHERE clause" applied to them during the
backup/dump process.
It would be awesome if pg_dump offered something similar. I am willing
to code this up, but there are questions. And I want to do it in a way
that makes the most sense to the PG Community.
My initial thought is a simple filtering file, of the format:
schema.table = WHERE ...
"schema"."Table" = WHERE ...
"schema"."t1" = LIMIT 50000
If the --filter-data <filename> is not specified, the code would not
change anything. If it was, it would see if the table was in the file with
a filter, it would read that filter, and apply it to the resulting COPY
command.
I don't believe this impacts pg_restore. But if we wanted to include any
kind of messaging in the restore process that "pg_dump --filter-data was
used, this is NOT a complete dump!", then I would appreciate that, and
include it in pg_restore.
Just to make the point. The full pg_dump takes 60 minutes (During which,
we must turn off certain features to avoid throwing errors/locking
issues). Excluding these tables takes 2-3 minutes. (Side dumping limited
versions of them with \COPY takes 3 minutes). And frankly we have enough
backups of the many years of data, we don't need daily snapshots of them,
or to carry them around.
Thanks in advance. I am hoping that I am not the only one that would
benefit from a filtered dump (as opposed to all or nothing).
Finally, I considered using an entire query, which could allow data-masking
and more complex queries if there are FKs involved. But that seemed like a
much bigger ask (and a potential foot-gun).
PS: A Quick Hack feature we could leverage would be to flag the
ignore-table-data to generate an EMPTY .dat file, and the internal toc.dat
reference to load that file... THEN simply overwrite that file with our
manual \COPY command. This would be almost a trivial change, and would
work for what we do/need. BUT it feels "off" a bit.
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2025-08-15 16:17:31 | Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture |
Previous Message | Nathan Bossart | 2025-08-15 15:51:24 | Re: shmem_startup_hook called twice on Windows |