Re: Suggestion for --truncate-tables to pg_restore

From: "Karl O(dot) Pinc" <kop(at)meme(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for --truncate-tables to pg_restore
Date: 2012-09-21 15:54:05
Message-ID: 1348242845.21480.0@mofo
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote:

> I've had problems using pg_restore --data-only when
> restoring individual schemas (which contain data which
> has had bad things done to it). --clean does not work
> well because of dependent objects in other schemas.

Before doing any more work I want to report on the
discussions that took place at the code sprint at
Postgres Open in Chicago. Because I'm going to add
in additional thoughts I've had and to avoid mis-representing
anybody's opinion I'll not mention who said what.
Feel free to step forward and claim Ingenious Ideas
as your own. Likewise I apologize if lack of attribution
makes it more difficult to discern (my) uninformed drivel
from intelligent insight.

----

First, the problem:

Begin with the following structure:

CREATE TABLE schemaA.foo (id PRIMARY KEY, data INT);

CREATE VIEW schemaB.bar AS SELECT * FROM schemaA.foo;

Then, by accident, somebody does:

UPDATE schemaA.foo SET data = data + (RANDOM() * 1000)::INT;

So, you want to restore the data into schemaA.foo.
But schemaA.foo has (bad) data in it that must first
be removed. It would seem that using

pg_restore --clean -n schemaA -t foo my_pg_dump_backup

would solve the problem, it would drop schemaA.foo,
recreate it, and then restore the data. But this does
not work. schemaA.foo does not drop because it's
got a dependent database object, schemaB.bar.

Of course there are manual work-arounds. One of these
is truncating schemaA.foo and then doing a pg_restore
with --data-only. The manual work-arounds become
increasingly burdensome as you need to restore more
tables. The case that motivated me was an attempt
to restore the data in an entire schema, one which
contained a significant number of tables.

So, the idea here is to be able to do a data-only
restore, first truncating the data in the tables
being restored to remove the existing corrupted data.

The proposal is to add a --truncate-tables option
to pg_restore.

----

There were some comments on syntax.

I proposed to use -u as a short option. This was
thought confusing, given it's use in other
Unix command line programs (mysql). Since there's
no obvious short option, forget it. Just have
a long option.

Another choice is to avoid introducing yet another
option and instead overload --clean so that when
doing a --data-only restore --clean truncates tables
and otherwise --clean retains the existing behavior of
dropping and re-creating the restored objects.

(I tested pg_restore with 9.1 and when --data-only is
used --clean is ignored, it does not even produce a warning.
This is arguably a bug.)

----

More serious objections were raised regarding semantics.

What if, instead, the initial structure looked like:

CREATE TABLE schemaA.foo
(id PRIMARY KEY, data INT);

CREATE TABLE schemaB.bar
(id INT CONSTRAINT "bar_on_foo" REFERENCES foo
, moredata INT);

With a case like this, in most real-world situations, you'd
have to use pg_restore with --disable-triggers if you wanted
to use --data-only and --truncate-tables. The possibility of
foreign key referential integrity corruption is obvious.

Aside: Unless you're restoring databases in their entirety
the pg_restore --disable-triggers option makes it easy to
introduce foreign key referential integrity corruption.
In fact, since pg_restore does not wrap it's operations
in one big transaction, it's easy to attempt restoration
of a portion of a database, have part of the process
succeed and part of it fail (due to either schema
or data dependencies), and be left off worse
than before you started. The pg_restore docs might
benefit from a big fat warning regarding
attempts to restore less than an entire database.

So, the discussion went, pg_restore is just another
application and introducing more options
which could lead to corruption of referential integrity is
a bad idea.

But pg_restore should not be thought of as just another
front-end. It should be thought of as a data recovery
tool. Recovering some data and being left with referential
integrity problems is better than having no data. This
is true even if, due to different users owning different
schemas and so forth, nobody knows exactly what
might be broken.

Yes, but we can do better. (The unstated sub-text being that
we don't want to introduce an inferior feature which
will then need to be supported forever.)

How could we do better:

Here I will record only the ideas related to restore,
although there was some mention of dump as well.

There has apparently been some discussion of writing
a foreign data wrapper which would operate on a database
dump. This might (in ways that are not immediately
obvious to me) address this issue.

The restore process could, based on what table data needs
restoration, look at foreign key dependencies and produce a
list of the tables which all must be restored into order to
ensure foreign key referential integrity. In the case of
restoration into a empty database the foreign key
dependences must be calculated from the dump. (An
"easy" way to do this would be to create
all the database objects in some temporary place and query
the system catalogs to produce the dependency graph.)
In the case of restoration into an
existing database the foreign key dependences should
come from the database into which the data is to be restored.
(This is necessary to capture dependences which may have
been introduced after the dump was taken.)

The above applies to data-only restoration. When restoring the
database schema meta-information (object definition) a similar
graph of database object dependences must be produced and used
to determine what needs to be restored.

But when doing a partial data-only restore there is more
to data integrity than just foreign key referential integrity.
Other constraints and triggers ensure other sorts of
data integrity rules. It is not enough to leave
triggers turned on when restoring data. Data not
restored may validate against restored data in triggers
fired only on manipulation of the un-restored table content.
The only solution I can see is to also include in the
computed set of tables which require restoration those
tables having triggers which reference any of the restored
data.

Just how far should pg_restore go in attempting to
preserve data integrity?

----

Two things are clear:

The current table and schema oriented options for backing
up and restoring portions of databases are flawed with
respect to data integrity.

Life is complicated.

Where should I go from here? I am not now in a position to
pursue anything more complicated than completing the code to
add a --truncate-tables option to pg_restore. Should I
finish this and send in a patch?

Karl <kop(at)meme(dot)com>
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2012-09-21 16:15:25 Re: DB & Schema
Previous Message Kohei KaiGai 2012-09-21 15:46:23 Re: 64-bit API for large object