Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-18 03:41:40
Message-ID: 20220218034140.t7sprel3qpghzu4t@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-17 16:34:34 -0800, Andres Freund wrote:
> I've done this for a couple hundred iterations, under load, subsequently with
> additional assertions, without being able to reproduce.

Playing around with this further I did get into one interesting state:

I started psql with replication=1, created a slot
SELECT pg_drop_replication_slot('t');SELECT pg_create_physical_replication_slot('t', true);
which was at 0/200002C8 at this point
START_REPLICATION SLOT "t" 0/200002C8 TIMELINE 1
then did
SELECT pg_switch_wal();checkpoint;
in another session, which tried to invalidate that slot because of
max_slot_wal_keep_size=1MB

At this point the psql "walsender" is *not* terminated, because the fatal
message is waiting behind all the WAL data sent. Which also means that the
slot isn't yet released.

At this point checkpointer is stuck, because it is waiting for the connection
to end.

Quitting the "psql walsender" or terminating it again resolves the issue.

Not sure how this could be related, but it seems interesting.

"psql walsender":

(gdb) bt
#0 0x00007faf4a570ec6 in epoll_wait (epfd=4, events=0x7faf4c201458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007faf4b8ced5c in WaitEventSetWaitBlock (set=0x7faf4c2013e0, cur_timeout=-1, occurred_events=0x7ffe47df2320, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2 0x00007faf4b8cebe3 in WaitEventSetWait (set=0x7faf4c2013e0, timeout=-1, occurred_events=0x7ffe47df2320, nevents=1, wait_event_info=100663297)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:298
#4 0x00007faf4b71aecb in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1380
#5 0x00007faf4b71ada1 in internal_putbytes (s=0x7ffe47df23dc "E\177", len=1) at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1326
#6 0x00007faf4b71b0cf in socket_putmessage (msgtype=69 'E', s=0x7faf4c201700 "SFATAL", len=112)
at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1507
#7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301
#8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253
#9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
#10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts")
at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
#11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192
#12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404
#13 0x00007faf4b88f82e in StartReplication (cmd=0x7faf4c293fc0) at /home/andres/src/postgresql/src/backend/replication/walsender.c:834
#14 0x00007faf4b89136f in exec_replication_command (cmd_string=0x7faf4c2073c0 "START_REPLICATION SLOT \"t\" 0/1B000000 TIMELINE 1\n;")
at /home/andres/src/postgresql/src/backend/replication/walsender.c:1771
#15 0x00007faf4b909842 in PostgresMain (dbname=0x7faf4c22fce8 "", username=0x7faf4c22fcc8 "andres")
at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4494
#16 0x00007faf4b830e7e in BackendRun (port=0x7faf4c22da50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4593
#17 0x00007faf4b83075f in BackendStartup (port=0x7faf4c22da50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4321
#18 0x00007faf4b82c55b in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1801
#19 0x00007faf4b82bd71 in PostmasterMain (argc=49, argv=0x7faf4c1ff1e0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1473

checkpointer:
(gdb) bt
#0 0x00007f0d4ee1fec6 in epoll_wait (epfd=10, events=0x7f0d51489b78, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f0d5017dd5c in WaitEventSetWaitBlock (set=0x7f0d51489b18, cur_timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
#2 0x00007f0d5017dbe3 in WaitEventSetWait (set=0x7f0d51489b18, timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
#3 0x00007f0d5017cec4 in WaitLatch (latch=0x7f0d4c27db04, wakeEvents=33, timeout=-1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:472
#4 0x00007f0d5018fe34 in ConditionVariableTimedSleep (cv=0x7f0d4c342268, timeout=-1, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:163
#5 0x00007f0d5018fd6c in ConditionVariableSleep (cv=0x7f0d4c342268, wait_event_info=134217772)
at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:100
#6 0x00007f0d50132525 in InvalidatePossiblyObsoleteSlot (s=0x7f0d4c342188, oldestLSN=553648128, invalidated=0x7fffe7c5f577)
at /home/andres/src/postgresql/src/backend/replication/slot.c:1290
#7 0x00007f0d501326af in InvalidateObsoleteReplicationSlots (oldestSegno=33) at /home/andres/src/postgresql/src/backend/replication/slot.c:1359
#8 0x00007f0d4fd84072 in CreateCheckPoint (flags=108) at /home/andres/src/postgresql/src/backend/access/transam/xlog.c:6620
#9 0x00007f0d500cd278 in CheckpointerMain () at /home/andres/src/postgresql/src/backend/postmaster/checkpointer.c:445
#10 0x00007f0d500cad7a in AuxiliaryProcessMain (auxtype=CheckpointerProcess) at /home/andres/src/postgresql/src/backend/postmaster/auxprocess.c:153
#11 0x00007f0d500e034c in StartChildProcess (type=CheckpointerProcess) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5531
#12 0x00007f0d500dacf1 in PostmasterMain (argc=49, argv=0x7f0d514891e0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1458
#13 0x00007f0d4ffcfc19 in main (argc=49, argv=0x7f0d514891e0) at /home/andres/src/postgresql/src/backend/main/main.c:202

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-18 04:11:53 Re: Race conditions in 019_replslot_limit.pl
Previous Message Fujii Masao 2022-02-18 03:22:32 Re: Fix CheckIndexCompatible comment