Re: global / super barriers (for checksums)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject: Re: global / super barriers (for checksums)
Date: 2020-01-03 20:30:47
Message-ID: CA+TgmobORrsgSUydZ3MsSw9L5MBUGz7jRK+973uPZgiyCQ81ag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 9, 2019 at 10:42 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > Hm. In my mental model it would be useful for barrier "processors" to
> > not acknowledge the state change at certain points. Imagine e.g. needing
> > to efficiently wait till all backends have processed a config file
> > reload - since we don't reload while a query is being processed, we
> > should be able to not ack the barrier at that point.
>
> Yeah, you mentioned this before, but I think we could leave it for
> later. I think it's going to be complex. I suppose the way to do it
> would be to add an argument to ProcessProcSignalBarrier() that lets
> you specify which barrier events you're OK with processing from the
> current point in the code. However, that will mean that whenever
> somebody adds a new barrier type, they have got to go through all of
> those callers and think about whether they should add their new
> barrier type into the flags or not. If we try to do the specific thing
> you're talking about with config-file processing, it will also mean
> that we could be waiting MUCH longer for acknowledgements, which I
> think might have pretty far-reaching ramifications. You might get
> stuck waiting for that barrier to be absorbed for hours, and that
> would also impact later barriers, since you won't see the generation
> bump to 42 until it goes through 40 and 41. That sounds fairly awful.
> You could try to work around it by looking at which barrier flags are
> set, but I think that has ABA problems.

I might have rejected this idea too hastily. At any rate, I have a new
view on it having studied the issue a bit more.

In the case of ALTER SYSTEM READ ONLY, the big problem is that we
can't afford to find out that we're unable to emit WAL after we've
already entered a critical section, because "ERROR: system is read
only" or similar will get promoted to PANIC due to the critical
section and that will be sad. (Before someone asks, not promoting the
ERROR to PANIC would be unsafe, even for this specific case.) Some WAL
records don't use a critical section, though it's recently been
proposed[1] that we add a critical section in at least some of the
cases that currently lack one. For those, I think we can just upgrade
the existing test-and-elog cases in XLogInsert() and XLogBeginInsert()
to ereport() and call it good. But the cases that do have a critical
section are harder.

Currently, I see only the following two options:

(1) Check just before entering a critical section whether or not the
system is read only, and if so, error out. Once we've entered a
critical section, ProcessInterrupts() will not do anything, so at that
point we're safe. This could be done either by making
START_CRIT_SECTION() itself do it or by finding every critical section
that is used for inserting WAL and adding a call just beforehand. The
latter is probably better, as some critical sections appear to be used
for other purposes; see PGSTAT_BEGIN_WRITE_ACTIVITY in particular. But
it means changing things in a bunch of places, adding an if-test. That
wouldn't add *much* overhead, but it's not quite zero.

(2) Accept barrier events for read-only/read-write changes only at
command boundaries. In that case, we only need to check for a read
only state around the places where we currently do
PreventCommandIfReadOnly and similar, and the additional overhead
should be basically nil. However, this seems pretty unappealing to me,
because it means that if you try to make the system READ ONLY, it
might run for hours before actually going read only. If you're trying
to make the system read only because the master has become isolated
from the rest of your network, that's not a fast enough response.

In general, I think the problem here is to avoid TOCTTOU problems. I
think the checksums patch has this problem too. For instance, the
changes to basebackup.c in that function check
DataChecksumsNeedVerify() only once per file, but what is to say that
a call to ProcessInterrupts() couldn't happen in the middle of sending
a file, changing prevailing value? If someone sets checksums off, and
everyone acknowledges that they're off, then backends may begin
writing blocks without setting checksums, and then this code will
potentially try to verify checksums in a block written without them.
That patch also modifies the definition of XLogHintBitIsNeeded(), so
e.g. log_heap_visible is wrong if a CHECK_FOR_INTERRUPTS() can happen
in XLogRegisterBuffer(). I don't think it can currently, but it seems
like a heck of a fragile assumption, and I don't see how we can be
sure that there's no test for XLogHintBitIsNeeded() that does
CHECK_FOR_INTERRUPTS() between where it's tested and where it does
something critical with the result of that test.

At the moment, the ALTER SYSTEM READ ONLY problem appears to me to be
the lesser of the two (although I may be wrong). Assuming people are
OK with inserting an additional check before each xlog-writing
critical section, we can just go do that and then I think the problem
is pretty much solved. The only way there would be a residual problem,
I think, is if something could change between the time XLogInsert()
checks whether xlog writing is allowed and the time when it does the
insert. Even if that's an issue, it can be fixed with a
HOLD_INTERRUPTS/RESUME_INTERRUPTS in just that function. The meaning
of READ ONLY is just that xlog insertion is prohibited, so the only
use of the value is for allowing XLogInsert() to go ahead or not. But
for checksums, there's not such a clear definition of what it means to
use the value, nor does it get used in only one place, so the
situation seems less clear.

In short, I am not still not sure that we need the facility Andres was
asking for here, but I do now understand better why Andres was worried
about having this capability, and I do think that patches using it are
going to need to be very carefully studied for TOCTTOU problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] http://postgr.es/m/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-01-03 20:39:39 Re: Assigning ROW variable having NULL value to RECORD type variable doesn't give any structure to the RECORD variable.
Previous Message Jeff Janes 2020-01-03 20:25:47 Re: color by default