Re: Drop replslot after pgstat_shutdown cause assert coredump

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Greg Nancarrow <gregn4422(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Drop replslot after pgstat_shutdown cause assert coredump
Date: 2021-10-21 17:10:21
Message-ID: a40241b7-05f2-206b-c0ae-5f7d7f34734a@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/10/11 22:15, Greg Nancarrow wrote:
> On Mon, Oct 11, 2021 at 6:55 PM houzj(dot)fnst(at)fujitsu(dot)com
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>>
>> I can see the walsender tried to release a not-quite-ready repliaction slot
>> that was created when create a subscription. But the pgstat has been shutdown
>> before invoking ReplicationSlotRelease().
>>
>> The stack is as follows:
>>
>> #2 in ExceptionalCondition (pgstat_is_initialized && !pgstat_is_shutdown)
>> #3 in pgstat_assert_is_up () at pgstat.c:4852
>> #4 in pgstat_send (msg=msg(at)entry=0x7ffe716f7470, len=len(at)entry=144) at pgstat.c:3075
>> #5 in pgstat_report_replslot_drop (slotname=slotname(at)entry=0x7fbcf57a3c98 "sub") at pgstat.c:1869
>> #6 in ReplicationSlotDropPtr (slot=0x7fbcf57a3c80) at slot.c:696
>> #7 in ReplicationSlotDropAcquired () at slot.c:585
>> #8 in ReplicationSlotRelease () at slot.c:482
>> #9 in ProcKill (code=<optimized out>, arg=<optimized out>) at proc.c:852
>> #10 in shmem_exit (code=code(at)entry=0) at ipc.c:272
>> #11 in proc_exit_prepare (code=code(at)entry=0) at ipc.c:194
>> #12 in proc_exit (code=code(at)entry=0) at ipc.c:107
>> #13 in ProcessRepliesIfAny () at walsender.c:1807
>> #14 in WalSndWaitForWal (loc=loc(at)entry=22087632) at walsender.c:1417
>> #15 in logical_read_xlog_page (state=0x2f8c600, targetPagePtr=22085632,
>> reqLen=, targetRecPtr=, cur_page=0x2f6c1e0 "\016\321\005") at walsender.c:821
>> #16 in ReadPageInternal (state=state(at)entry=0x2f8c600,
>> pageptr=pageptr(at)entry=22085632, reqLen=reqLen(at)entry=2000) at xlogreader.c:667
>> #17 in XLogReadRecord (state=0x2f8c600,
>> errormsg=errormsg(at)entry=0x7ffe716f7f98) at xlogreader.c:337
>> #18 in DecodingContextFindStartpoint (ctx=ctx(at)entry=0x2f8c240)
>> at logical.c:606
>> #19 in CreateReplicationSlot (cmd=cmd(at)entry=0x2f1aef0)
>>
>> Is this behavior expected ?
>>
>
> I'd say it's not!

Yes. I think this is a bug.

> Just looking at the stacktrace, I'm thinking that the following commit
> may have had a bearing on this problem, by causing pgstat to be
> shutdown earlier:
>
> commit fb2c5028e63589c01fbdf8b031be824766dc7a68
> Author: Andres Freund <andres(at)anarazel(dot)de>
> Date: Fri Aug 6 10:05:57 2021 -0700
>
> pgstat: Schedule per-backend pgstat shutdown via before_shmem_exit().
>
>
> Can you see if the problem can be reproduced prior to this commit?

Even in prior to the commit, pgstat_shutdown_hook() can be called
before ProcKill() at the backend exit, so ISTM that the problem can
be reproduced.

Probably we need to make sure that pgstat_shutdown_hook() is called
after ProcKill(), e.g., by registering pgstat_shutdown_hook() into
on_proc_exit_list (I'm not sure if this change is safe, though).
Or maybe pgstat logic for replication slot drop needs to be overhauled.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2021-10-21 17:14:40 Re: pgstat_assert_is_up() can fail in walsender
Previous Message Andreas Karlsson 2021-10-21 16:35:08 Re: Adding CI to our tree