RE: prevent immature WAL streaming

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "mengjuan(dot)cmj(at)alibaba-inc(dot)com" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>
Subject: RE: prevent immature WAL streaming
Date: 2021-08-25 11:59:45
Message-ID: VI1PR0701MB69608F806BD102C8B4C3C6FDF6C69@VI1PR0701MB6960.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Álvaro, -hackers,

> I attach the patch with the change you suggested.

I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including 5065aeafb0b7593c04d3bc5bc2a86037f32143fc). Previously(yesterday) without the v02 patch I was getting standby corruption always via simulation by having separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120, archive_command set to rsync to different dir on same fs, wal_init_zero at default(true).

Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries on standby. Probably the 2 failures were somehow my fault (?) or some rare condition (and in 1 of those 2 cases simply restarting standby did help). To be honest I've tried to force this error, but with v02 I simply cannot force this error anymore, so that's good! :)

> I didn't have a lot of luck with a reliable reproducer script. I was able to
> reproduce the problem starting with Ryo Matsumura's script and attaching
> a replica; most of the time the replica would recover by restarting from a
> streaming position earlier than where the problem occurred; but a few
> times it would just get stuck with a WAL segment containing a bogus
> record.

In order to get reliable reproducer and get proper the fault injection instead of playing with really filling up fs, apparently one could substitute fd with fd of /dev/full using e.g. dup2() so that every write is going to throw this error too:

root(at)hive:~# ./t & # simple while(1) { fprintf() flush () } testcase
root(at)hive:~# ls -l /proc/27296/fd/3
lrwx------ 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite
root(at)hive:~# gdb -q -p 27296
-- 1089 is bitmask O_WRONLY|..
(gdb) p dup2(open("/dev/full", 1089, 0777), 3)
$1 = 3
(gdb) c
Continuing.
==>
fflush/write(): : No space left on device

So I've also tried to be malicious while writing to the DB and inject ENOSPCE near places like:

a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) is true, one gets classic "PANIC: could not write to file "pg_wal/xlogtemp.90670": No space left on device"
b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC: could not write to log file 000000010000003B000000A8 at offset 0, length 15466496: No space left on device" (that would be possible with wal_init_zero=false?)
c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC: could not fdatasync file "000000010000004300000004": Invalid argument" that would pretty much mean same as above but with last possible offset near end of WAL?

This was done with gdb voodoo:
handle SIGUSR1 noprint nostop
break xlog.c:<LINE> // https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311
c
print fd or openLogFile -- to verify it is 3
p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it has current log as fd=3

After restarting master and inspecting standby - in all of those above 3 cases - the standby didn't inhibit the "invalid contrecord length" at least here, while without corruption this v02 patch it is notorious. So if it passes the worst-case code review assumptions I would be wondering if it shouldn't even be committed as it stands right now.

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-08-25 12:03:24 Re: replay of CREATE TABLESPACE eats data at wal_level=minimal
Previous Message Masahiko Sawada 2021-08-25 11:32:01 Re: Failure of subscription tests with topminnow