Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-25 20:15:58
Message-ID: 20220225201558.iabxf6k7edohggo7@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > Seems to suggest something is holding a problematic lock in a way that I do not understand yet:
>
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
> > 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL)
> > ...
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin
>
> > last message from 1997084 until the immediate shutdown.
>
> Hmm. Maybe put a couple more debug messages into ReplicationSlotCleanup
> and/or ReplicationSlotDropPtr? It doesn't seem very clear where in that
> sequence it's hanging up.

Yea, was thinking that as well.

I'm also wondering whether it's worth adding an assert, or at least a WARNING,
about no lwlocks held to the tail end of ShutdownPostgres? I don't want to
add an LWLockReleaseAll() yet, before I understand what's actually happening.

> > We could be more certain if we shut down the cluster in fast rather than
> > immediate mode. So I'm thinking of doing something like
>
> Does that risk an indefinite hangup of the buildfarm run?

I think not. The pg_ctl stop -m fast should time out after PGCTLTIMEOUT,
$self->_update_pid(-1); should notice it's not dead. The END{} block should
then shut it down in immediate mode.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2022-02-25 20:19:24 Re: [PATCH] Expose port->authn_id to extensions and triggers
Previous Message Tom Lane 2022-02-25 20:07:01 Re: Race conditions in 019_replslot_limit.pl