Re: Build-farm - intermittent error in 031_column_list.pl

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Build-farm - intermittent error in 031_column_list.pl
Date: 2022-05-26 02:37:27
Message-ID: CAA4eK1K40xhObN1MWO7=rzrJmo+oQ048O8drZ86-F7artVvwQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 5/25/22 13:26, Amit Kapila wrote:
> > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> >>
> >> It does "fix" the case of [1]. But AFAIS
> >> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> >> loading in get_rel_sync_entry and the function doesn't seem to be
> >> assumed to return a invalid entry. (Since the flag is not checked
> >> nowhere else.)
> >>
> >> For example pgoutput_change does not check for the flag of the entry
> >> returned from the function before uses it, which is not seemingly
> >> safe. (I didn't check further, though)
> >>
> >> Don't we need to explicitly avoid using invalid entries outside the
> >> function?
> >>
> >
> > We decide that based on pubactions in the callers, so even if entry is
> > valid, it won't do anything. Actually, we don't need to avoid setting
> > replication_valid flag as some of the publications for the table may
> > be already present. We can check if the publications_valid flag is set
> > while trying to validate the entry. Now, even if we don't find any
> > publications the replicate_valid flag will be set but none of the
> > actions will be set, so it won't do anything in the caller. Is this
> > better than the previous approach?
> >
>
> For the record, I'm not convinced this is the right way to fix the
> issue, as it may easily mask the real problem.
>
> We do silently ignore missing objects in various places, but only when
> either requested or when it's obvious it's expected and safe to ignore.
> But I'm not sure that applies here, in a clean way.
>
> Imagine you have a subscriber using two publications p1 and p2, and
> someone comes around and drops p1 by mistake. With the proposed patch,
> the subscription will notice this, but it'll continue sending data
> ignoring the missing publication. Yes, it will continue working, but
> it's quite possible this breaks the subscriber and it's be better to
> fail and stop replicating.
>

Ideally, shouldn't we disallow drop of publication in such cases where
it is part of some subscription? I know it will be tricky because some
subscriptions could be disabled.

> The other aspect I dislike is that we just stop caching publication
> info, forcing us to reload it for every replicated change/row. So even
> if dropping the publication happens not to "break" the subscriber (i.e.
> the data makes sense), this may easily cause performance issues, lag in
> the replication, and so on. And the users will have no idea why and/or
> how to fix it, because we just do this silently.
>

Yeah, this is true that if there are missing publications, it needs to
reload all the publications info again unless we build a mechanism to
change the existing cached entry by loading only required info. The
other thing we could do here is to LOG the info for missing
publications to make users aware of the fact. I think we can also
introduce a new option while defining/altering subscription to
indicate whether to continue on missing publication or not, that way
by default we will stop replication as we are doing now but users will
have a way to move replication.

BTW, what are the other options we have to fix the cases where
replication is broken (or the user has no clue on how to proceed) as
we are discussing the case here or the OP reported yet another case on
pgsql-bugs [1]?

[1] - https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2022-05-26 02:40:58 Re: adding status for COPY progress report
Previous Message Tom Lane 2022-05-26 02:35:06 Re: "ERROR: latch already owned" on gharial