Re: [SQL] Duplicate tuples with unique index

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Palle Girgensohn <girgen(at)partitur(dot)se>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Patrik Kudo <kudo(at)partitur(dot)se>, "pgsql-sql(at)postgreSQL(dot)org" <pgsql-sql(at)postgreSQL(dot)org>
Subject: Re: [SQL] Duplicate tuples with unique index
Date: 2000-01-26 05:38:43
Message-ID: 16091.948865123@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Palle Girgensohn <girgen(at)partitur(dot)se> writes:
>>>> Nope. pg_upgrade was "disabled in this release because the
>>>> internal blahblahblah disk layout changed from previous versions".
>>
>> Sorry, you have to edit the script to reenable it.

> That's OK. I didn't bother to check. the dump/restore sequence was pretty quick.
> The layout hasn't really changed since 6.5, right?

No; the disk layout is the same. There is room for trouble nonetheless.
There is some doubt about whether pg_upgrade will work right under MVCC
semantics: in MVCC, whether a tuple stored in a table is considered
valid or not will depend on whether its creator transaction is marked
committed in pg_log (and, if it is marked with a deletor transaction,
whether the deletor is NOT committed).

pg_upgrade tries to deal with this by copying the old database
installation's pg_log into the new. OK, that ensures that all the
user-table tuples preserve their commit state; but what about tuples
in the system tables? The idea behind pg_upgrade is to paste together
a set of user tables with a set of system tables --- but unless the
transaction number history of the user tables is exactly the same as
the transaction number history of the system tables, there is clearly
a risk that committed tuples will suddenly be considered not-committed
or vice versa. And ordinarily those histories will *not* be identical.

The only reason pg_upgrade has any chance at all of working is that
for efficiency reasons we don't want to go back and consult pg_log
for every single tuple we read. So, there are really six states of
a tuple on disk, which may be described as:
1. I was created by transaction N, but I dunno if it committed
2. I was created by a transaction that definitely committed,
so I'm good; no need to look at pg_log
3. I was created by a transaction that definitely aborted,
so I'm dead; no need to look at pg_log
4. I was deleted by transaction N, but I dunno if it committed
5. I was deleted by a transaction that definitely committed,
so I'm dead; no need to look at pg_log
6. I was deleted by a transaction that definitely aborted,
so I'm still good; no need to look at pg_log
(I'm fuzzing over some fine points that arise when multiple transactions
try to delete the same tuple, but this level of detail will do for now.)
Whenever any backend examines a tuple in state 1 or 4, it will consult
pg_log to discover the state of the source transaction. If the source
transaction has now committed or aborted, the tuple will be rewritten
with the new state (2,3,5,6 as appropriate) so that future readers of
the tuple don't have to look at pg_log again.

Now states 2,3,5,6 do not depend on what pg_log says. Therefore,
pg_upgrade's copy of an old pg_log file into a new database will
work *if and only if* all the tuples in the system tables are in
one of these states, and none of them are in states 1 or 4. We
can survive with user-table tuples that are in the uncertain states,
because the associated pg_log info will be copied over with them.
We cannot afford to have any uncertainly-committed system-table
tuples, because we will overwrite the pg_log data about their status.

Bruce thinks that the pg_upgrade script will ensure that the system-
table tuples are all in frozen states (by VACUUMing them). I don't
trust it worth a dime, myself. Maybe it will work, but it hasn't been
proven in the field. So, if you'd like to try it, by all means do so
--- but make a pg_dump backup first! And let us know whether you have
problems or not!

regards, tom lane

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Bruce Momjian 2000-01-26 05:43:22 Re: [SQL] Duplicate tuples with unique index
Previous Message Palle Girgensohn 2000-01-26 00:37:01 Re: [SQL] Duplicate tuples with unique index