Re: using an end-of-recovery record in all cases

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: using an end-of-recovery record in all cases
Date: 2022-04-20 17:02:24
Message-ID: 20220420170224.GA2579385@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 20, 2022 at 09:26:07AM -0400, Robert Haas wrote:
> I was talking with Thomas Munro yesterday and he thinks there is a
> problem with relfilenode reuse here. In normal running, when a
> relation is dropped, we leave behind a 0-length file until the next
> checkpoint; this keeps that relfilenode from being used even if the
> OID counter wraps around. If we didn't do that, then imagine that
> while running with wal_level=minimal, we drop an existing relation,
> create a new relation with the same OID, load some data into it, and
> crash, all within the same checkpoint cycle, then we will be able to
> replay the drop, but we will not be able to restore the relation
> contents afterward because at wal_level=minimal they are not logged.
> Apparently, we don't create tombstone files during recovery because we
> know that there will be a checkpoint at the end.

In the example you provided, won't the tombstone file already be present
before the crash? During recovery, the tombstone file will be removed, and
the new relation wouldn't use the same relfilenode anyway. I'm probably
missing something obvious here.

I do see the problem if we drop an existing relation, crash, reuse the
filenode, and then crash again (all within the same checkpoint cycle). The
first recovery would remove the tombstone file, and the second recovery
would wipe out the new relation's files.

> With the existing use of the end-of-recovery record, we always know
> that wal_level>minimal, because we're only using it on standbys. But
> with this use that wouldn't be true any more. So I guess we need to
> start creating tombstone files even during recovery, or else do
> something like what Dilip coded up in
> http://postgr.es/m/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com
> which I think would be a better solution at least in the long term.

IMO this would be good just to reduce the branching a bit. I suppose
removing the files immediately during recovery might be an optimization in
some cases, but I am skeptical that it really makes that much of a
difference in practice.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-04-20 17:03:45 Re: Query generates infinite loop
Previous Message Tom Lane 2022-04-20 16:53:50 Re: Fix NULL pointer reference in _outPathTarget()