Subscription tests fail under CLOBBER_CACHE_ALWAYS

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Subscription tests fail under CLOBBER_CACHE_ALWAYS
Date: 2021-05-18 23:42:08
Message-ID: 3382681.1621381328@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I discovered $SUBJECT after wondering why hyrax hadn't reported
in recently, and trying to run check-world under CCA to see if
anything got stuck. Indeed it did --- although this doesn't
explain the radio silence from hyrax, because that animal doesn't
run any TAP tests. (Neither does avocet, which I think is the
only other active CCA critter. So this could have been broken
for a very long time.)

I count three distinct bugs that were exposed by this attempt:

1. In the part of 013_partition.pl that tests firing AFTER
triggers on partitioned tables, we have a case of continuing
to access a relcache entry that's already been closed.
(I'm not quite sure why prion's -DRELCACHE_FORCE_RELEASE
hasn't exposed this.) It looks to me like instead we had
a relcache reference leak before f3b141c48, but now, the
only relcache reference count on a partition child table
is dropped by ExecCleanupTupleRouting, which logical/worker.c
invokes before it fires triggers on that table. Kaboom.
This might go away if worker.c weren't so creatively different
from the other code paths concerned with executor shutdown.

2. Said bug causes a segfault in the apply worker process.
This causes the parent postmaster to give up and die.
I don't understand why we don't treat that like a crash
in a regular backend, considering that an apply worker
is running largely user-defined code.

3. Once the subscriber1 postmaster has exited, the TAP
test will eventually time out, and then this happens:

timed out waiting for catchup at t/013_partition.pl line 219.
### Stopping node "publisher" using mode immediate
# Running: pg_ctl -D /Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_publisher_data/pgdata -m immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "publisher"
### Stopping node "subscriber1" using mode immediate
# Running: pg_ctl -D /Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_subscriber1_data/pgdata -m immediate stop
pg_ctl: PID file "/Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_subscriber1_data/pgdata/postmaster.pid" does not exist
Is server running?
Bail out! system pg_ctl failed

That is, because we failed to shut down subscriber1, the
test script neglects to shut down subscriber2, and now
things just sit indefinitely. So that's a robustness
problem in the TAP infrastructure, rather than a bug in
PG proper; but I still say it's a bug that needs fixing.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-05-18 23:45:04 Re: pgbench test failing on 14beta1 on Debian/i386
Previous Message Thomas Munro 2021-05-18 23:40:59 Re: pgbench test failing on 14beta1 on Debian/i386