Re: Introduce wait_for_subscription_sync for TAP tests

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce wait_for_subscription_sync for TAP tests
Date: 2022-09-30 02:11:49
Message-ID: CA+hUKG+WKsZpdoryeqM8_rk5uQPCqS2HGY92WiMGFsK2wVkcig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 12, 2022 at 10:42 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Sat, Sep 10, 2022 at 10:00 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > On Sat, Sep 10, 2022 at 9:45 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> writes:
> > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
> > > > is relevant.
> > >
> > > Noting that the errors have only appeared in the past couple of
> > > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f
> > > (Fix recovery_prefetch with low maintenance_io_concurrency).
> >
> > Yeah, I also just spotted the coincidence of those failures while
> > monitoring the build farm. I'll look into this later today. My
> > initial suspicion is that there was pre-existing code here that was
> > (incorrectly?) relying on the lack of error reporting in that case.
> > But maybe I misunderstood and it was incorrect to report the error for
> > some reason that was not robustly covered with tests.
>
> After I wrote that I saw Sawada-san's message and waited for more
> information, and I see there was now a commit. I noticed that
> peripatus was already logging the 'missing contrecord' error even when
> it didn't fail the test, and still does. I'm still looking into that
> (ie whether I need to take that new report_invalid_record() call out
> and replace it with errormsg_deferred = true so that XLogReadRecord()
> returns NULL with no error message in this case).

I will go ahead and remove this new error message added by adb46615.
The message was correlated with the problem on peripatus fixed by
88f48831, but not the cause of it -- but it's also not terribly
helpful and might be confusing. It might be reported: (1) in
pg_waldump when you hit the end of the segment with a missing
contrecord after the end, arguably rightfully, but then perhaps
someone might complain that they expect an error from pg_waldump only
on the final live segment at the end of the WAL, and (2) in a
walsender that is asked to shut down while between reads of pages with
a spanning contrecord, reported by logical_read_xlog_page() with a
messageless error (presumably peripatus's case?), (3) in crash
recovery with wal_recycle off (whereas normally you'd expect to see
the page_addr message from a recycled file), maybe more legitimately
than the above. The problem I needed to solve can be solved without
the message just by setting that flag as mentioned above, so I'll do
that to remove the new noise.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-09-30 02:15:19 Re: more descriptive message for process termination due to max_slot_wal_keep_size
Previous Message Andres Freund 2022-09-30 02:09:33 Re: longfin and tamandua aren't too happy but I'm not sure why