Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Date: 2024-01-19 09:03:01
Message-ID: Zao6xfSQUry5Fkdn@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
> Hello,
>
> 15.01.2024 12:00, Alexander Lakhin wrote:
> > If this approach looks promising to you, maybe we could add a submodule to
> > perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
> > 019_replslot_limit) as well.
> >
> > Personally I think that having such a functionality for using in tests
> > might be useful not only to avoid some "problematic" behaviour but also to
> > test the opposite cases.
>
> After spending a few days on it, I've discovered two more issues:
> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
>
> (The latter is not related to bgwriter directly, but it was discovered
> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)
>
> So it becomes clear that the 035 test is not the only one, which might
> suffer from bgwriter's activity,

Yeah... thanks for sharing!

> and inventing a way to stop bgwriter/
> control it is a different subject, getting out of scope of the current
> issue.

Agree.

> 15.01.2024 11:49, Bertrand Drouvot wrote:
> > We did a few things in this thread, so to sum up what we've discovered:
> >
> > - a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
> > - we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
> > ( proposal in in v6 attached)
> > - we need a way to control how frequent xl_running_xacts are emmitted (to ensure
> > they are not triggered in a middle of an active slot invalidation test).
> >
> > I'm not sure it's possible to address Tom's concern and keep the test "predictable".
> >
> > So, I think I'd vote for Michael's proposal to implement a superuser-settable
> > developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would
> > still not address Tom's concern anyway).
> >
> > Another option would be to "sacrifice" the full predictablity of the test (in
> > favor of real-world behavior testing)?
> >
> > [1]: https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal
>
> So, now we have the test 035 failing due to nondeterministic vacuum
> activity in the first place, and due to bgwriter's activity in the second.

Yeah, that's also my understanding.

> Maybe it would be a right move to commit the fix, and then think about
> more rare failures.

+1

> Though I have a couple of question regarding the fix left, if you don't
> mind:
> 1) The test has minor defects in the comments, that I noted before [1];
> would you like to fix them in passing?
>
> > BTW, it looks like the comment:
> > # One way to produce recovery conflict is to create/drop a relation and
> > # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
> > in scenario 3 is a copy-paste error.

Nice catch, thanks! Fixed in v7 attached.

> > Also, there are two "Scenario 4" in this test.

D'oh! Fixed in v7.

> >
>
> 2) Shall we align the 035_standby_logical_decoding with
> 031_recovery_conflict in regard to improving stability of vacuum?

Yeah, I think that could make sense.

> I see the following options for this:
> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
> b) use FREEZE and autovacuum = off in both tests;
> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
>  autovacuum = off in both.
>

I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not
sure using FREEZE would give full "stabilization predictability", at least for
035_standby_logical_decoding.pl). That said I did not test what the outcome would
be for 031_recovery_conflict.pl by making use of a).

> I've re-tested the v6 patch today and confirmed that it makes the test
> more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
> parallel and got failures ('inactiveslot slot invalidation is logged with
> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
> (With unlimited CPU, when average test duration is around 70 seconds.)
>
> But with v6 applied, 60 iterations succeeded.

Nice! Thanks for the testing!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v7-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch text/x-diff 5.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2024-01-19 09:20:50 Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]
Previous Message Shubham Khanna 2024-01-19 08:48:53 Re: speed up a logical replica setup