LISTEN/NOTIFY testing woes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Mark Dilger <hornschnorter(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)gmail(dot)com>
Subject: LISTEN/NOTIFY testing woes
Date: 2019-11-24 01:01:42
Message-ID: 13881.1574557302@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I ran into a couple of issues while trying to devise a regression test
illustrating the LISTEN-in-serializable-transaction issue Mark Dilger
reported. The first one is that an isolation test in which we expect
to see a cross-process NOTIFY immediately after a COMMIT turns out to
be not very stable: on my machine, it works as long as you're just
running the isolation tests by themselves, but it usually falls over
if I'm running check-world with any amount of parallelism. The reason
for this seems to be that incoming notifies are only checked for when
we're about to wait for client input. At that point we've already
sent the ReadyForQuery ('Z') protocol message, which will cause libpq
to stand down from looking for more input and return a null from
PQgetResult(). Depending on timing, the following Notify protocol
messages might arrive quickly enough that isolationtester.c sees them
before it goes off to do something else, but that's not very reliable.

In the case of self-notifies, postgres.c ensures that those get
transmitted to the frontend *before* ReadyForQuery, and this is what
makes self-notify cases stable enough to survive buildfarm testing.

I'm a bit surprised, now that I've seen this effect, that the existing
cross-session notify tests in async-notify.spec haven't given us
problems in the buildfarm. (Maybe, now that I just pushed those into
the back branches, we'll start to see some failures?) Anyway, what
I propose to do about this is patch 0001 attached, which tweaks
postgres.c to ensure that any cross-session notifies that arrived
during the just-finished transaction are also guaranteed to be sent
to the client before, not after, ReadyForQuery.

Another thing that I discovered while testing this is that as of HEAD,
you can't run "make installcheck" for the isolation tests more than
once without restarting the server. If you do, you get a test result
mismatch because the async-notify test's first invocation of
pg_notification_queue_usage() returns a positive value. Which is
entirely unsurprising, because the previous iteration ensured that
it would, and we've done nothing to make the queue tail advance since
then.

This seems both undesirable for our own testing, and rather bogus
from users' standpoints as well. However, I think a simple fix is
available: just make the SQL pg_notification_queue_usage() function
advance the queue tail before measuring, as in 0002 below. This will
restore the behavior of that function to what it was before 51004c717,
and it doesn't seem like it'd cost any performance in any plausible
use-cases.

0002 is only needed in HEAD, but I'll have to back-patch 0001 as
far as 9.6, to support a test case for the problem Mark discovered
and to ensure that back-patching b10f40bf0 doesn't cause any issues.

BTW, the fix and test case for Mark's issue look like 0003. Without
the 0001 patch, it's unstable exactly when the "listener2: NOTIFY "c1"
with payload "" from notifier" message comes out. But modulo that
issue, this test case reliably shows the assertion failure in the
back branches.

regards, tom lane

Attachment Content-Type Size
0001-stabilize-notify-report-timing.patch text/x-diff 748 bytes
0002-stabilize-queue-usage-results.patch text/x-diff 467 bytes
0003-fix-serializable-listen.patch text/x-diff 6.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Steve Singer 2019-11-24 01:10:51 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Previous Message Andrew Dunstan 2019-11-23 21:34:05 Re: backup manifests