RE: prevent immature WAL streaming

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "mengjuan(dot)cmj(at)alibaba-inc(dot)com" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, Ryo Matsumura <matsumura(dot)ryo(at)fujitsu(dot)com>
Subject: RE: prevent immature WAL streaming
Date: 2021-10-13 12:53:37
Message-ID: DBAPR07MB69523FCDF7B372389C6EF124F6B79@DBAPR07MB6952.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021-Sep-25, Alvaro Herrera wrote:
>> On 2021-Sep-24, Alvaro Herrera wrote:
>>
>> > Here's the set for all branches, which I think are really final, in
>> > case somebody wants to play and reproduce their respective problem
>> scenarios.
>>
>> I forgot to mention that I'll wait until 14.0 is tagged before getting anything
>> pushed.

Hi Alvaro, sorry for being late to the party, but to add some reassurance that v2-commited-fix this really fixes solves the initial production problem, I've done limited test on it (just like with the v1-patch idea earlier/ with using wal_keep_segments, wal_init_zero=on, archive_mode=on and archive_command='/bin/true')

- On 12.8, I was able like last time to manually reproduce it on 3 out of 3 tries and I've got: 2x "invalid contrecord length", 1x "there is no contrecord flag" on standby.

- On soon-to-be-become-12.9 REL_12_STABLE (with commit 1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries, I've got beautiful insight into what happened:
LOG: started streaming WAL from primary at 1/EC000000 on timeline 1
LOG: sucessfully skipped missing contrecord at 1/EBFFFFF8, overwritten at 2021-10-13 11:22:37.48305+00
CONTEXT: WAL redo at 1/EC000028 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBFFFFF8; time 2021-10-13 11:22:37.48305+00
...and slave was able to carry-on automatically. In 4th test, the cascade was tested too (m -> s1 -> s11) and both {s1,s11} did behave properly and log the above message. Also additional check proved that after simulating ENOSPC crash on master the data contents were identical everywhere (m1=s1=s11).

Thank you Alvaro and also to everybody else who participated in solving this challenging and really edge-case nasty bug.

-J.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2021-10-13 12:55:38 Re: [RFC] building postgres with meson
Previous Message Daniel Gustafsson 2021-10-13 11:54:10 Re: [RFC] building postgres with meson