Re: Timeout failure in 019_replslot_limit.pl

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Timeout failure in 019_replslot_limit.pl
Date: 2021-09-22 07:27:39
Message-ID: YUra61XH96/jnqf3@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:
> On 2021-Sep-20, Michael Paquier wrote:
>> The test gets the right PIDs, as the logs showed:
>> ok 17 - have walsender pid 12663
>> ok 18 - have walreceiver pid 12662
>
> As I understood, Horiguchi-san's point isn't that the PIDs might be
> wrong -- the point is to make sure that the process is in state T before
> moving on to the next step in the test.

I have spent more time on this issue, as it looks that I am the only
one with an environment able to reproduce it (Big Sur 11.6).

As far as I can see, the states of the WAL sender and receiver are
fine, after adding some extra debugging with ps called from the test
itself, and I have checked that they are SIGSTOP'd or SIGCONT'd when a
failure shows up.

In a sequence that passes, we have the following sequence:
- Start checkpoint.
- SIGSTOP sent to WAL sender and receiver.
- Advance WAL (CREATE TABLE, DROP TABLE and pg_switch_wal)
- Check that WAL sender is stopped
- SIGCONT on WAL sender.
- Invalidate the slot, checkpoint completes.
- Check state of pg_replication_slots.
- Check that slot invalidation happened in the logs.

A failed sequence starts a checkpoint, but never completes it when
we reach the check on pg_replication_slots and the test remains
stuck so the slot is never invalidated. I have played a bit with the
test and switched a bit the location of the test "slot invalidation
logged" that watches the logs, and the test fails to find the slot
invalidation, as a result of the checkpoint not finishing.

To keep the instance around for debugging, I have just launched an
extra checkpoint after the SIGCONT sent to the WAL sender. It remains
stuck as an effect of the first one:
kill 'CONT', $senderpid;
+$node_primary3->safe_psql('postgres', 'checkpoint;');

With that, I am able to grab the checkpointer of primary3 to see where
it is waiting:
* frame #0: 0x00007fff204f8c4a libsystem_kernel.dylib`kevent + 10
frame #1: 0x0000000105a81a43 postgres`WaitEventSetWaitBlock(set=0x00007fb87f008748, cur_timeout=-1, occurred_events=0x00007ffeea765400, nevents=1) at latch.c:1601:7
frame #2: 0x0000000105a80fd0 postgres`WaitEventSetWait(set=0x00007fb87f008748, timeout=-1, occurred_events=0x00007ffeea765400, nevents=1, wait_event_info=134217769) at latch.c:1396:8
frame #3: 0x0000000105a80b46 postgres`WaitLatch(latch=0x00000001069ae7a4, wakeEvents=33, timeout=-1, wait_event_info=134217769) at latch.c:473:6
frame #4: 0x0000000105a97011 postgres`ConditionVariableTimedSleep(cv=0x00000001069d8860, timeout=-1, wait_event_info=134217769) at condition_variable.c:163:10
frame #5: 0x0000000105a96f32 postgres`ConditionVariableSleep(cv=0x00000001069d8860, wait_event_info=134217769) at condition_variable.c:100:9
frame #6: 0x0000000105a299cf postgres`InvalidatePossiblyObsoleteSlot(s=0x00000001069d8780, oldestLSN=8388608, invalidated=0x00007ffeea76559f) at slot.c:1264:4
frame #7: 0x0000000105a296bd postgres`InvalidateObsoleteReplicationSlots(oldestSegno=8) at slot.c:1333:7
frame #8: 0x00000001055edbe6 postgres`CreateCheckPoint(flags=192) at xlog.c:9275:6
frame #9: 0x00000001059b753d postgres`CheckpointerMain at checkpointer.c:448:5
frame #10: 0x00000001059b470d postgres`AuxiliaryProcessMain(auxtype=CheckpointerProcess) at auxprocess.c:153:4
frame #11: 0x00000001059c8912 postgres`StartChildProcess(type=CheckpointerProcess) at postmaster.c:5498:3
frame #12: 0x00000001059c68fe postgres`PostmasterMain(argc=4, argv=0x00007fb87e505400) at postmaster.c:1458:21
frame #13: 0x000000010589e1bf postgres`main(argc=4, argv=0x00007fb87e505400) at main.c:198:3
frame #14: 0x00007fff20544f3d libdyld.dylib`start + 1

So there is really something fishy here IMO, something else than just
a test mis-design and it looks like a race condition, perhaps around
InvalidateObsoleteReplicationSlots().

Thoughts?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-09-22 07:39:10 Re: Proposal: Save user's original authenticated identity for logging
Previous Message Juan José Santamaría Flecha 2021-09-22 07:06:03 Re: windows build slow due to windows.h includes