Re: Race conditions in 019_replslot_limit.pl

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-17 01:22:23
Message-ID: 3518071.1645060943@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> I think the test is telling us that something may be broken. We shouldn't
> silence that without at least some understanding what it is.

I looked at the recent failure on komodoensis [1], and I think what is
happening is just that the walsender for the basebackup run (launched
at 019_replslot_limit.pl line 325) hasn't exited yet at the point where
we do a blind
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"
and expect that we're only going to see the walsender launched for
the standby at line 331. The two PIDs reported in the failure
correspond to this postmaster log trace:

2022-02-16 23:06:29.596 CET [620d7565.38dd62:1] LOG: connection received: host=[local]
2022-02-16 23:06:29.596 CET [620d7565.38dd62:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
2022-02-16 23:06:29.596 CET [620d7565.38dd62:3] LOG: received replication command: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:4] STATEMENT: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-16 23:06:29.596 CET [620d7565.38dd62:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( RESERVE_WAL)
2022-02-16 23:06:29.597 CET [620d7565.38dd62:7] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:8] STATEMENT: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000 TIMELINE 1
2022-02-16 23:06:29.597 CET [620d7565.38dd62:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000 TIMELINE 1

and this one:

2022-02-16 23:06:29.687 CET [620d7565.38dd6f:1] LOG: connection received: host=[local]
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:2] LOG: replication connection authorized: user=bf application_name=standby_3
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:3] LOG: received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:4] STATEMENT: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:5] LOG: received replication command: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:6] STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
shouldn't we fix that? But in any case, I don't see anything
interesting here, just a query that needs to be more selective.
Perhaps we can look for application_name=standby_3?

regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2022-02-16%2021%3A16%3A04

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-17 01:28:02 Re: libpq async duplicate error results
Previous Message Peter Geoghegan 2022-02-17 01:16:13 Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM