| From: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> | 
|---|---|
| To: | Alexander Lakhin <exclusion(at)gmail(dot)com> | 
| Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> | 
| Subject: | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed | 
| Date: | 2024-01-11 14:58:24 | 
| Message-ID: | ZaACEPtVovKlRVUf@ip-10-97-1-34.eu-west-3.compute.internal | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On Thu, Jan 11, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> 10.01.2024 19:32, Bertrand Drouvot wrote:
> > 
> > > is an absent message 'obsolete replication slot "row_removal_activeslot"'
> > Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:
> > 
> > Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().
> > 
> > In case of an active slot we first call ReportSlotInvalidation with the second
> > parameter set to true (to emit the "terminating" message), then SIGTERM the active
> > process and then (later) we should call the other ReportSlotInvalidation()
> > call with the second parameter set to false (to emit the message that we don't
> > see here).
> > 
> > So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
> > call.
> 
> I've found a way to reproduce the issue without slowing down a machine or
> running multiple tests in parallel. It's enough for this to add a delay to
> allow BackgroundWriterMain() to execute LogStandbySnapshot():
> @@ -692,2 +690,3 @@
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
> +$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
> 
> With this delay, I get the failure immediately:
> $ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C src/test/recovery
> # +++ tap check in src/test/recovery +++
> t/035_standby_logical_decoding.pl .. 47/?
> #   Failed test 'activeslot slot invalidation is logged with on-access pruning'
> #   at t/035_standby_logical_decoding.pl line 227.
Thanks a lot for the testing!
So I think we have 2 issues here:
1) The one you're mentioning above related to the on-access pruning test:
I think the engine behavior is right here and that the test is racy. I'm
proposing to bypass the active slot invalidation check for this particular test (
as I don't see any "easy" way to take care of this race condition). The active slot
invalidation is already well covered in the other tests anyway. 
I'm proposing the attached v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch
for it.
2) The fact that sometime we're getting a termination message which is not followed
by an obsolete one (like as discussed in [1]).
For this one, I think that InvalidatePossiblyObsoleteSlot() is racy:
In case of an active slot we proceed in 2 steps:
 - terminate the backend holding the slot
 - report the slot as obsolete
This is racy because between the two we release the mutex on the slot, which
means the effective_xmin and effective_catalog_xmin could advance during that time.
I'm proposing the attached v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
for it.
Would it be possible to re-launch your repro (the slow one, not the pg_sleep() one)
with bot patch applied and see how it goes? (Please note that v4 replaces v3 that
you're already using in your tests).
If it helps, I'll propose v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
into a dedicated hackers thread.
[1]: https://www.postgresql.org/message-id/ZZ7GpII4bAYN%2BjT5%40ip-10-97-1-34.eu-west-3.compute.internal
Regards,
-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
| Attachment | Content-Type | Size | 
|---|---|---|
| v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch | text/x-diff | 9.9 KB | 
| v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch | text/x-diff | 2.6 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2024-01-11 15:01:16 | Re: A recent message added to pg_upgade | 
| Previous Message | Jonathan S. Katz | 2024-01-11 14:47:33 | Re: heavily contended lwlocks with long wait queues scale badly |