Re: Why is subscription/t/031_column_list.pl failing so much?

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Why is subscription/t/031_column_list.pl failing so much?
Date: 2024-02-07 09:55:48
Message-ID: CAA4eK1KBK6ndZ6E+2SvPNAxZ2xNnykz_Qb5Yz6BFY3U-pEeC7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 7, 2024 at 2:06 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I wrote:
> > More to the point, aren't these proposals just band-aids that
> > would stabilize the test without fixing the actual problem?
> > The same thing is likely to happen to people in the field,
> > unless we do something drastic like removing ALTER SUBSCRIPTION.
>
> I've been able to make the 031_column_list.pl failure pretty
> reproducible by adding a delay in walsender, as attached.
>
> While I'm not too familiar with this code, it definitely does appear
> that the new walsender is told to start up at an LSN before the
> creation of the publication, and then if it needs to decide whether
> to stream a particular data change before it's reached that creation,
> kaboom!
>
> I read and understood the upthread worries about it not being
> a great idea to ignore publication lookup failures, but I really
> don't see that we have much choice. As an example, if a subscriber
> is humming along reading publication pub1, and then someone
> drops and then recreates pub1 on the publisher, I don't think that
> the subscriber will be able to advance through that gap if there
> are any operations within it that require deciding if they should
> be streamed.
>

Right. One idea to address those worries was to have a new
subscription option like ignore_nonexistant_pubs (or some better name
for such an option). The 'true' value of this new option means that we
will ignore the publication lookup failures and continue replication,
the 'false' means give an error as we are doing now. If we agree that
such an option is useful or at least saves us in some cases as
discussed in another thread [1], we can keep the default value as true
so that users don't face such errors by default and also have a way to
go back to current behavior.

>
(That is, contrary to Amit's expectation that
> DROP/CREATE would mask the problem, I suspect it will instead turn
> it into a hard failure. I've not experimented though.)
>

This is not contrary because I was suggesting to DROP/CREATE
Subscription whereas you are talking of drop and recreate of
Publication.

> BTW, this same change breaks two other subscription tests:
> 015_stream.pl and 022_twophase_cascade.pl.
> The symptoms are different (no "publication does not exist" errors),
> so maybe these are just test problems not fundamental weaknesses.
>

As per the initial analysis, this is because those cases have somewhat
larger transactions (more than 64kB) under test so it just times out
waiting for all the data to be replicated. We will do further analysis
and share the findings.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-02-07 09:58:40 Re: Postgres and --config-file option
Previous Message Erik Wienhold 2024-02-07 09:54:21 Re: Psql meta-command conninfo+