Re: prevent immature WAL streaming

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "mengjuan(dot)cmj(at)alibaba-inc(dot)com" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, "Jakub(dot)Wartak(at)tomtom(dot)com" <Jakub(dot)Wartak(at)tomtom(dot)com>, Ryo Matsumura <matsumura(dot)ryo(at)fujitsu(dot)com>
Subject: Re: prevent immature WAL streaming
Date: 2021-11-23 20:40:35
Message-ID: 202111232040.fzyfnrdwtxu6@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021-Nov-23, Tom Lane wrote:

> We're *still* not out of the woods with 026_overwrite_contrecord.pl,
> as we are continuing to see occasional "mismatching overwritten LSN"
> failures, further down in the test where it tries to start up the
> standby:

Augh.

> Looking at adjacent successful runs, it seems that the exact point
> where the "missing contrecord" starts varies substantially, even after
> our previous fix to disable autovacuum in this test. How could that be?

Well, there is intentionally some variability. Maybe not as much as one
would wish, but I expect that that should explain why that point is not
always the same.

> It's probably for the best though, because I think this is exposing
> an actual bug that we would not have seen if the start point were
> completely consistent. I have not dug into the code, but it looks to
> me like if the "consistent recovery state" is reached exactly at a
> page boundary (0/1FFE000 in all these cases), then the standby expects
> that to be what the OVERWRITE_CONTRECORD record will point at. But
> actually it points to the first WAL record on that page, resulting
> in a bogus failure.

So what is happening is that we set state->overwrittenRecPtr to the LSN
of page start, ignoring the page header. Is that the LSN of the first
record in a page? I'll see if I can reproduce the problem.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"La persona que no quería pecar / estaba obligada a sentarse
en duras y empinadas sillas / desprovistas, por cierto
de blandos atenuantes" (Patricio Vogel)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2021-11-23 20:51:51 Re: Avoiding smgrimmedsync() during nbtree index builds
Previous Message Justin Pryzby 2021-11-23 19:51:09 Re: pg_upgrade parallelism