Re: failure in 019_replslot_limit

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: failure in 019_replslot_limit
Date: 2024-02-09 15:00:01
Message-ID: 8e038d1e-8126-c93b-f2a8-964bc8782d81@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Andres,

05.04.2023 21:48, Andres Freund wrote:
> I just saw the following failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-04-05%2017%3A47%3A03
> after a commit of mine. The symptoms look unrelated though.
>
> [17:54:42.188](258.345s) # poll_query_until timed out executing this query:
> # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
> # expecting this output:
> # lost
> # last actual query output:
> # unreserved
> # with stderr:
> timed out waiting for slot to be lost at /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 400.
>
> We're expecting "lost" but are getting "unreserved".
>
> ...
>
> ISTM that this indicates that checkpointer got stuck after signalling
> 344783.
>
> Do you see any other explanation?
>

I've managed to reproduce this issue (which still persists:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-02-04%2001%3A53%3A44
) and saw that it's not checkpointer, but walsender is hanging:
[12:15:03.753](34.771s) ok 17 - have walsender pid 317885
[12:15:03.875](0.122s) ok 18 - have walreceiver pid 317884
[12:15:04.808](0.933s) ok 19 - walsender termination logged
...

Last essential messages in _primary3.log are:
2024-02-09 12:15:04.823 UTC [318036][not initialized][:0] LOG: connection received: host=[local]
2024-02-09 12:15:04.823 UTC [317885][walsender][3/0:0] FATAL: terminating connection due to administrator command
2024-02-09 12:15:04.823 UTC [317885][walsender][3/0:0] STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
(then the test just queries the slot state, there are no other messages
related to walsender)

And I see the walsender process still running (I've increased the timeout
to keep the test running and to connect to the process in question), with
the following stack trace:
#0  0x00007fe4feac3d16 in epoll_wait (epfd=5, events=0x55b279b70f38, maxevents=1, timeout=timeout(at)entry=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055b278b9ab32 in WaitEventSetWaitBlock (set=set(at)entry=0x55b279b70eb8, cur_timeout=cur_timeout(at)entry=-1,
occurred_events=occurred_events(at)entry=0x7ffda5ffac90, nevents=nevents(at)entry=1) at latch.c:1571
#2  0x000055b278b9b6b6 in WaitEventSetWait (set=0x55b279b70eb8, timeout=timeout(at)entry=-1,
occurred_events=occurred_events(at)entry=0x7ffda5ffac90, nevents=nevents(at)entry=1,
wait_event_info=wait_event_info(at)entry=100663297) at latch.c:1517
#3  0x000055b278a3f11f in secure_write (port=0x55b279b65aa0, ptr=ptr(at)entry=0x55b279bfbd08, len=len(at)entry=21470) at
be-secure.c:296
#4  0x000055b278a460dc in internal_flush () at pqcomm.c:1356
#5  0x000055b278a461d4 in internal_putbytes (s=s(at)entry=0x7ffda5ffad3c "E\177", len=len(at)entry=1) at pqcomm.c:1302
#6  0x000055b278a46299 in socket_putmessage (msgtype=<optimized out>, s=0x55b279b363c0 "SFATAL", len=112) at pqcomm.c:1483
#7  0x000055b278a48670 in pq_endmessage (buf=buf(at)entry=0x7ffda5ffada0) at pqformat.c:302
#8  0x000055b278d0c82a in send_message_to_frontend (edata=edata(at)entry=0x55b27908e500 <errordata>) at elog.c:3590
#9  0x000055b278d0cfe2 in EmitErrorReport () at elog.c:1716
#10 0x000055b278d0d17d in errfinish (filename=filename(at)entry=0x55b278eaa480 "postgres.c", lineno=lineno(at)entry=3295,
funcname=funcname(at)entry=0x55b278eaaef0 <__func__.16> "ProcessInterrupts") at elog.c:551
#11 0x000055b278bc41c9 in ProcessInterrupts () at postgres.c:3295
#12 0x000055b278b6c9af in WalSndLoop (send_data=send_data(at)entry=0x55b278b6c346 <XLogSendPhysical>) at walsender.c:2680
#13 0x000055b278b6cef1 in StartReplication (cmd=cmd(at)entry=0x55b279b733e0) at walsender.c:987
#14 0x000055b278b6d865 in exec_replication_command (cmd_string=cmd_string(at)entry=0x55b279b39c60 "START_REPLICATION SLOT
\"rep3\" 0/700000 TIMELINE 1") at walsender.c:2039
#15 0x000055b278bc7d71 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4649
#16 0x000055b278b2329d in BackendRun (port=port(at)entry=0x55b279b65aa0) at postmaster.c:4464
#17 0x000055b278b263ae in BackendStartup (port=port(at)entry=0x55b279b65aa0) at postmaster.c:4140
#18 0x000055b278b26539 in ServerLoop () at postmaster.c:1776
#19 0x000055b278b27ac5 in PostmasterMain (argc=argc(at)entry=4, argv=argv(at)entry=0x55b279b34180) at postmaster.c:1475
#20 0x000055b278a49ab0 in main (argc=4, argv=0x55b279b34180) at main.c:198

(gdb) frame 9
(gdb) print edata->message
$3 = 0x55b279b367d0 "terminating connection due to administrator command"

So it looks like walsender tries to send the message to walreceiver, which
is paused, and thus walsender gets stuck on it.

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavlo Golub 2024-02-09 15:01:58 [PATCH] allow pg_current_logfile() execution under pg_monitor role
Previous Message Jim Jones 2024-02-09 13:19:13 Re: [PATCH] Add CANONICAL option to xmlserialize