Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-25 19:29:41
Message-ID: 20220225192941.hqnvefgdzaro6gzg@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-22 18:06:24 -0800, Andres Freund wrote:
> On 2022-02-18 15:14:15 -0800, Andres Freund wrote:
> > I'm running out of ideas for how to try to reproduce this. I think we might
> > need some additional debugging information to get more information from the
> > buildfarm.
>
> > I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
> > to pg_basebackup in $node_primary3->backup(...).
> >
> > It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
> > ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().
>
> Planning to commit something like the attached.

This did provide us a bit more detail.

Seems to suggest something is holding a problematic lock in a way that I do not understand yet:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL)
...
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin

last message from 1997084 until the immediate shutdown. Just checking for
temporary replication slots that need to be dropped requires
ReplicationSlotControlLock. Actually dropping also requires
ReplicationSlotAllocationLock

1997084 does have to a temporary replication slot to clean up.

2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:35] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:36] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:37] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:38] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: done

1997083 succeeds in doing the cleanup. It does not have a temporary
replication slot. Making it look like somehow ReplicationSlotAllocationLock
hasn't been released.

2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:39] 019_replslot_limit.pl DEBUG: shmem_exit(0): 7 on_shmem_exit callbacks to make
...
2022-02-23 09:09:53.076 EST [2022-02-23 09:09:52 EST 1997072:87] LOG: received immediate shutdown request
...
2022-02-23 09:09:53.095 EST [2022-02-23 09:09:52 EST 1997072:90] DEBUG: server process (PID 1997084) exited with exit code 2

It's *possible*, but not likely, that somehow 1997084 just doesn't get
scheduled for a prolonged amount of time.

We could be more certain if we shut down the cluster in fast rather than
immediate mode. So I'm thinking of doing something like

# We've seen occasionales cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
my ($stdout, $stderr);
$node_primary3->psql('postgres',
"\\a\\t\nSELECT * FROM pg_stat_activity",
stdout => \$stdout, stderr => \$stderr);
diag $stdout, $stderr;
$node_primary3->stop('fast');
$node_standby3->stop('fast');
die "could not determine walsender pid, can't continue";
}

Does that make sense? Better ideas?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-02-25 19:32:24 Re: Design of pg_stat_subscription_workers vs pgstats
Previous Message Euler Taveira 2022-02-25 19:25:01 Re: Design of pg_stat_subscription_workers vs pgstats