Re: Add a perl function in Cluster.pm to generate WAL

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Euler Taveira <euler(at)eulerto(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Subject: Re: Add a perl function in Cluster.pm to generate WAL
Date: 2024-01-07 07:10:50
Message-ID: ZZpOenv6gv8TdZLE@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 05, 2024 at 11:00:00PM +0300, Alexander Lakhin wrote:
> Your suspicion was proved right. After
> git show c161ab74f src/test/recovery/t/035_standby_logical_decoding.pl  | git apply -R
> 20 iterations with 20 tests in parallel performed successfully for me
> (twice).
>
> So it looks like c161ab74f really made the things worse.

We have two different failures here, one when VACUUM fails for a
shared relation:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-01-03%2017%3A09%3A27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-01-01%2020%3A10%3A18

And the second failure happens for VACUUM FULL with a shared relation:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-01-03%2020%3A07%3A15

In the second case, the VACUUM FULL happens *BEFORE* the new
advance_wal(), making c161ab74f unrelated, no?

Anyway, if one looks at the buildfarm logs, this failure is more
ancient than c161ab74f. We have many of them before that, some
reported back in October:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-19%2000%3A44%3A58
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-30%2013%3A39%3A20

I suspect on the contrary that c161ab74f may be actually helping here,
because we've switched the CREATE TABLE/INSERT queries to not use a
snapshot anymore, reducing the reasons why a slot conflict would
happen? Or maybe that's just a matter of luck because the test is
racy anyway.

Anyway, this has the smell of a legit bug to me. I am also a bit
dubious about the choice of pg_authid as shared catalog to choose for
the slot invalidation check. Isn't that potentially racy with the
scans we may do on it at connection startup? Something else should be
chosen, like pg_shdescription as it is non-critical? I am adding in
CC Bertrand and Andres, as author and committer behind befcd77d53217b.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2024-01-07 07:17:16 Re: Extract numeric filed in JSONB more effectively
Previous Message Andy Fan 2024-01-07 07:09:24 Re: the s_lock_stuck on perform_spin_delay