Re: Make relfile tombstone files conditional on WAL level

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Make relfile tombstone files conditional on WAL level
Date: 2021-08-02 20:03:31
Message-ID: CA+Tgmoa0dBdYhm2EWWLmu9Kh9qhibPun1zLa-aCg+p6q+AQAHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> It would indeed be nice to have some other mechanism to prevent the
> issue with wal_level=minimal, the tombstone files feel hacky and
> complicated. Maybe a new shared memory hash table to track the
> relfilenodes of dropped tables.

Just to summarize the issue here as I understand it, if a relfilenode
is used for two unrelated relations during the same checkpoint cycle
with wal_level=minimal, and if the WAL-skipping optimization is
applied to the second of those but not to the first, then crash
recovery will lose our only copy of the new relation's data, because
we'll replay the removal of the old relfilenode but will not have
logged the new data. Furthermore, we've wondered about writing an
end-of-recovery record in all cases rather than sometimes writing an
end-of-recovery record and sometimes a checkpoint record. That would
allow another version of this same problem, since a single checkpoint
cycle could now span multiple server lifetimes. At present, we dodge
all this by keeping the first segment of the main fork around as a
zero-length file for the rest of the checkpoint cycle, which I think
prevents the problem in both cases. Now, apparently that caused some
problem with the AIO patch set so Thomas is curious about getting rid
of it, and Heikki concurs that it's a hack.

I guess my concern about this patch is that it just seems to be
reducing the number of cases where that hack is used without actually
getting rid of it. Rarely-taken code paths are more likely to have
undiscovered bugs, and that seems particularly likely in this case,
because this is a low-probability scenario to begin with. A lot of
clusters probably never have an OID counter wraparound ever, and even
in those that do, getting an OID collision with just the right timing
followed by a crash before a checkpoint can intervene has got to be
super-unlikely. Even as things are today, if this mechanism has subtle
bugs, it seems entirely possible that they could have escaped notice
up until now.

So I spent some time thinking about the question of getting rid of
tombstone files altogether. I don't think that Heikki's idea of a
shared memory hash table to track dropped relfilenodes can work. The
hash table will have to be of some fixed size N, and whatever the
value of N, the approach will break down if N+1 relfilenodes are
dropped in the same checkpoint cycle.

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already. So I
don't really have a good idea. I agree this tombstone system is a bit
of a wart, but I'm not sure that this patch really makes anything any
better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-08-02 20:06:15 Re: make MaxBackends available in _PG_init
Previous Message Alvaro Herrera 2021-08-02 19:34:07 Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS