Re: Replication slot drop message is sent after pgstats shutdown.

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication slot drop message is sent after pgstats shutdown.
Date: 2022-03-19 02:26:29
Message-ID: 20220319022629.GA2822749@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 18, 2022 at 01:24:15PM -0700, Andres Freund wrote:
> On 2022-03-18 00:28:37 -0700, Noah Misch wrote:
> > ===
> > diff -U3 /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out
> > --- /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out Tue Feb 15 06:58:14 2022
> > +++ /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out Tue Feb 15 11:38:14 2022
> > @@ -29,16 +29,17 @@
> > t
> > (1 row)
> >
> > -step s2_init: <... completed>
> > -ERROR: canceling statement due to user request
> > step s1_view_slot:
> > SELECT slot_name, slot_type, active FROM pg_replication_slots WHERE slot_name = 'slot_creation_error'
> >
> > -slot_name|slot_type|active
> > ----------+---------+------
> > -(0 rows)
> > +slot_name |slot_type|active
> > +-------------------+---------+------
> > +slot_creation_error|logical |t
> > +(1 row)
> >
> > step s1_c: COMMIT;
> > +step s2_init: <... completed>
> > +ERROR: canceling statement due to user request
> >
> > starting permutation: s1_b s1_xid s2_init s1_c s1_view_slot s1_drop_slot
> > step s1_b: BEGIN;
> > ===
> >
> > I can make it fail that way by injecting a 1s delay here:
> >
> > --- a/src/backend/tcop/postgres.c
> > +++ b/src/backend/tcop/postgres.c
> > @@ -3339,6 +3339,7 @@ ProcessInterrupts(void)
> > */
> > if (!DoingCommandRead)
> > {
> > + pg_usleep(1 * 1000 * 1000);
> > LockErrorCleanup();
> > ereport(ERROR,
> > (errcode(ERRCODE_QUERY_CANCELED),
>
> So isolationtester still sees the blocking condition from before the cancel
> processing is finished and thus proceeds to the next statement - which it
> normally should only do once the other running step is detected as still
> blocking?

Essentially. It called s1_view_slot too early. s2_init can remain blocked
arbitrarily long after pg_cancel_backend returns. Writing
s1_cancel_s2(s2_init) directs the isolationtester to send the cancel, then
wait for s2_init to finish, then wait for the cancel to finish.

> I wonder if we should emit <waiting> everytime a step is detected anew as
> being blocked to make it easier to understand issues like this.

Good question.

> > diff --git a/contrib/test_decoding/specs/slot_creation_error.spec b/contrib/test_decoding/specs/slot_creation_error.spec
> > index 6816696..d1e35bf 100644
> > --- a/contrib/test_decoding/specs/slot_creation_error.spec
> > +++ b/contrib/test_decoding/specs/slot_creation_error.spec
> > @@ -35,7 +35,7 @@ step s2_init {
> > # The tests first start a transaction with an xid assigned in s1, then create
> > # a slot in s2. The slot creation waits for s1's transaction to end. Instead
> > # we cancel / terminate s2.
> > -permutation s1_b s1_xid s2_init s1_view_slot s1_cancel_s2 s1_view_slot s1_c
> > +permutation s1_b s1_xid s2_init s1_view_slot s1_cancel_s2(s2_init) s1_view_slot s1_c
> > permutation s1_b s1_xid s2_init s1_c s1_view_slot s1_drop_slot # check slot creation still works
> > -permutation s1_b s1_xid s2_init s1_terminate_s2 s1_c s1_view_slot
> > +permutation s1_b s1_xid s2_init s1_terminate_s2(s2_init) s1_c s1_view_slot
> > # can't run tests after this, due to s2's connection failure
>
> That looks good to me.

Pushed. Kyotaro Horiguchi had posted a patch that also changed the
pg_terminate_backend call in temp-schema-cleanup.spec. I think that one is
fine as-is, because it does pg_terminate_backend(pg_backend_pid()). There's
no way for a backend running that particular command to report that it's ready
for another query, so the problem doesn't arise.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ajin Cherian 2022-03-19 03:40:34 Re: logical replication empty transactions
Previous Message Andres Freund 2022-03-19 01:47:07 Re: ICU for global collation