Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-03-27 21:32:19
Message-ID: 20220327213219.smdvfkq2fl74flow@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-03-27 15:28:05 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > How about the attached variation, which retries (for 15s, with 100ms sleeps)
> > if there are multiple walsenders, printing the debugging info each time? It'll
> > still fail the test if later iterations find just one walsender, which seems
> > the right behaviour for now.
>
> We have now four instances of failures with this version of the test,
> and in every case the second iteration succeeded. Is that enough
> evidence yet?

I still feel like there's something off here. But that's probably not enough
to keep causing failures. I'm inclined to leave the debugging in for a bit
longer, but not fail the test anymore?

The way the temporary slot removal hangs for a while seems just off:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-03-27%2017%3A04%3A18
2022-03-27 13:30:56.993 EDT [2022-03-27 13:30:56 EDT 750695:20] 019_replslot_limit.pl DEBUG: replication slot drop: pg_basebackup_750695: removed on-disk
...
2022-03-27 13:30:57.456 EDT [2022-03-27 13:30:57 EDT 750759:3] [unknown] LOG: connection authorized: user=andrew database=postgres application_name=019_replslot_limit.pl
2022-03-27 13:30:57.466 EDT [2022-03-27 13:30:57 EDT 750759:4] 019_replslot_limit.pl LOG: statement: SELECT * FROM pg_stat_activity
.
2022-03-27 13:30:57.474 EDT [2022-03-27 13:30:56 EDT 750679:87] DEBUG: server process (PID 750759) exited with exit code 0
2022-03-27 13:30:57.507 EDT [2022-03-27 13:30:56 EDT 750695:21] 019_replslot_limit.pl DEBUG: replication slot drop: pg_basebackup_750695: done

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2009%3A00%3A09
2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: begin
2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: removed on-disk
...
2022-03-25 10:13:31.038 CET [4022841][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl
2022-03-25 10:13:31.039 CET [4022841][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity
...
2022-03-25 10:13:31.045 CET [4022807][postmaster][:0][] DEBUG: server process (PID 4022841) exited with exit code 0
2022-03-25 10:13:31.056 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: done

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-03-25%2008%3A02%3A05
2022-03-25 09:15:20.558 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_3730425: removed on-disk
...
2022-03-25 09:15:20.803 CET [3730461][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl
2022-03-25 09:15:20.804 CET [3730461][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity
...
2022-03-25 09:15:20.834 CET [3730381][postmaster][:0][] DEBUG: server process (PID 3730461) exited with exit code 0
2022-03-25 09:15:20.861 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_3730425: done

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2003%3A00%3A18
2022-03-25 04:14:03.025 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_2674398: removed on-disk
...
2022-03-25 04:14:03.264 CET [2674463][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl
2022-03-25 04:14:03.265 CET [2674463][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity
...
2022-03-25 04:14:03.270 CET [2674384][postmaster][:0][] DEBUG: server process (PID 2674463) exited with exit code 0
...
2022-03-25 04:14:03.324 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_2674398: done

We are able to start / finish several new connections between the two debug
elog()sin ReplicationSlotDropPtr()?

I wonder if there's something odd going on with ConditionVariableBroadcast().

Might be useful to add another debug message before/after
ConditionVariableBroadcast() and rmtree(). If the delay is due to rmtree()
being slow under concurrent tests I'd feel less concerned (although that
machine has dual NVMe drives...).

I really wish I could reproduce the failure. I went through a few hundred
cycles of that test in a separate checkout on that machine.

> I'd like to silence this noise so that we can start tracking
> lower-probability failure modes, like say these:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2022-03-26%2002%3A59%3A03

That one was a missing compiler flag. I see that failure reproducibly locally
when using asan with clang, unless I use -fsanitize-address-use-after-return=never. gcc has a different default
for the option, which is why I hadn't configured it.

If I understand correctly, the problem is that
-fsanitize-address-use-after-return uses an alternative stack. Sometimes our
stack depths functions get called with the "proper" stack, and sometimes with
a "shadow" stack. Which breaks our stack depth checking.

> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-03-26%2015%3A53%3A51

This one I have no idea about yet. I assume it's just a race in a new test...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-03-27 21:36:14 Re: Race conditions in 019_replslot_limit.pl
Previous Message Tom Lane 2022-03-27 21:19:27 Re: [PATCH] Accept IP addresses in server certificate SANs