Re: using an end-of-recovery record in all cases

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: using an end-of-recovery record in all cases
Date: 2022-04-20 13:26:07
Message-ID: CA+TgmoZZDL_2E_zuahqpJ-WmkuxmUi8+g7=dLEny=18r-+c-iQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 19, 2022 at 4:38 PM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> Shouldn't latestCompletedXid be set to MaxTransactionId in this case? Or
> is this related to the logic in FullTransactionIdRetreat() that avoids
> skipping over the "actual" special transaction IDs?

The problem here is this code:

/* also initialize latestCompletedXid, to nextXid - 1 */
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
LWLockRelease(ProcArrayLock);

If nextXid is 3, then latestCompletedXid gets 2. But in
GetRunningTransactionData:

Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid));

> Your reasoning seems sound to me.

I was talking with Thomas Munro yesterday and he thinks there is a
problem with relfilenode reuse here. In normal running, when a
relation is dropped, we leave behind a 0-length file until the next
checkpoint; this keeps that relfilenode from being used even if the
OID counter wraps around. If we didn't do that, then imagine that
while running with wal_level=minimal, we drop an existing relation,
create a new relation with the same OID, load some data into it, and
crash, all within the same checkpoint cycle, then we will be able to
replay the drop, but we will not be able to restore the relation
contents afterward because at wal_level=minimal they are not logged.
Apparently, we don't create tombstone files during recovery because we
know that there will be a checkpoint at the end.

With the existing use of the end-of-recovery record, we always know
that wal_level>minimal, because we're only using it on standbys. But
with this use that wouldn't be true any more. So I guess we need to
start creating tombstone files even during recovery, or else do
something like what Dilip coded up in
http://postgr.es/m/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com
which I think would be a better solution at least in the long term.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-04-20 13:39:25 Re: Bad estimate with partial index
Previous Message Peter Eisentraut 2022-04-20 13:09:31 Re: [RFC] building postgres with meson -v8