Re: BUG #17928: Standby fails to decode WAL on termination of primary

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Sergei Kornilov <sk(at)zsrv(dot)org>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date: 2023-09-04 03:20:31
Message-ID: CA+hUKG+jFN64gKVGcmfNTekqWn3cemRx99-B8DDBaDyzWnfpkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thanks for looking/testing, Sergei. Thanks for the changes, Michael,
these all look good. I've squashed them and added you as co-author.

A couple more small comment/text changes:

1. In the place where we fail to allocate memory for an oversized
record, I copied the comment about treating that as a "bogus data"
condition. I suspect that we will soon be converting that to a FATAL
error[1], and that'll need to be done in both places.

2. In this version of the commit message I said we'd only back-patch
to 15 for now. After sleeping on this for a week, I realised that the
reason I keep vacillating on that point is that I am not sure what
your plan is for the malloc-failure-means-end-of-wal policy ([1],
ancient code from 0ffe11abd3a). If we're going to fix that in master
only but let sleeping dogs lie in the back-branches, then it becomes
less important to go back further than 15 with THIS patch. But if you
want to be able to distinguish garbage from out-of-memory, and thereby
end-of-wal from a FATAL please-insert-more-RAM condition, I think
you'd really need this industrial strength validation in all affected
branches, and I'd have more work to do, right? The weak validation we
are fixing here is the *real* underlying problem going back many
years, right?

I also wondered about strengthening the validation of various things
like redo begin/end LSNs etc in these tests. But we can always
continue to improve all this later...

Here also is a version for 15 (and a CI run[2]), since we tweaked many
of the error messages between 15 and 16.

[1] https://www.postgresql.org/message-id/flat/ZMh/WV%2BCuknqePQQ%40paquier.xyz
[2] https://cirrus-ci.com/task/4533280897236992 (failed on some
unrelated pgbench test, reported in another thread)

Attachment Content-Type Size
v5-0001-Don-t-trust-unvalidated-xl_tot_len.backpatch15 application/octet-stream 24.0 KB
v5-0001-Don-t-trust-unvalidated-xl_tot_len.patch text/x-patch 24.2 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2023-09-04 03:54:28 Re: BUG #17928: Standby fails to decode WAL on termination of primary
Previous Message Richard Guo 2023-09-04 03:01:41 Re: BUG #18077: PostgreSQL server subprocess crashed by a SELECT statement with WITH clause