Re: Why is subscription/t/031_column_list.pl failing so much?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Why is subscription/t/031_column_list.pl failing so much?
Date: 2024-02-06 20:36:29
Message-ID: 631312.1707251789@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> More to the point, aren't these proposals just band-aids that
> would stabilize the test without fixing the actual problem?
> The same thing is likely to happen to people in the field,
> unless we do something drastic like removing ALTER SUBSCRIPTION.

I've been able to make the 031_column_list.pl failure pretty
reproducible by adding a delay in walsender, as attached.

While I'm not too familiar with this code, it definitely does appear
that the new walsender is told to start up at an LSN before the
creation of the publication, and then if it needs to decide whether
to stream a particular data change before it's reached that creation,
kaboom!

I read and understood the upthread worries about it not being
a great idea to ignore publication lookup failures, but I really
don't see that we have much choice. As an example, if a subscriber
is humming along reading publication pub1, and then someone
drops and then recreates pub1 on the publisher, I don't think that
the subscriber will be able to advance through that gap if there
are any operations within it that require deciding if they should
be streamed. (That is, contrary to Amit's expectation that
DROP/CREATE would mask the problem, I suspect it will instead turn
it into a hard failure. I've not experimented though.)

BTW, this same change breaks two other subscription tests:
015_stream.pl and 022_twophase_cascade.pl.
The symptoms are different (no "publication does not exist" errors),
so maybe these are just test problems not fundamental weaknesses.
But "replication falls over if the walsender is slow" isn't
something I'd call acceptable.

regards, tom lane

Attachment Content-Type Size
hack-add-delay-in-walsender-loop.patch text/x-diff 487 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-02-06 20:37:25 Re: Remove Start* macros from postmaster.c to ease understanding of code
Previous Message Bharath Rupireddy 2024-02-06 19:18:00 Re: Remove Start* macros from postmaster.c to ease understanding of code