Re: standby promotion can create unreadable WAL

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: standby promotion can create unreadable WAL
Date: 2022-08-29 10:16:58
Message-ID: CAFiTN-s0FGJTs7GYm-uw5bzhRZ=d1m0jQ_zruAw2jYHBgP5G1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 26, 2022 at 6:14 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> However, if anything
> > did try to look at file #4 it would get confused. Maybe that can
> > happen if this is a streaming standby, where we only write an
> > end-of-recovery record upon promotion, rather than a checkpoint, or
> > maybe if there are cascading standbys someone could try to actually
> > use the 000000020000000000000004 file for something. I'm not sure. But
> > unless I'm missing something, that file is bogus, and our only hope of
> > not having problems is that perhaps no one will ever look at it.

I tried to see the problem with the cascading standby, basically the
setup is like below
pgprimary->pgstandby(archive only)->pgcascade(streaming + archive).

The second node has to be archive only because this 0 filled gap is
created in archive only mode. With that I have noticed that the when
cascading standby is getting that 0 filled gap it report same error
what we seen with pg_waldump and that it keep waiting forever on that
file. I have attached a test case, but I think timing is not done
perfectly in this test so before the cascading standby setup some of
the WAL file get removed by the pgstandby so I just put direct return
in RemoveOldXlogFiles() to test this[2]. And this problem is getting
resolved with the patch given by Robert upthread.

[1]
2022-08-25 16:21:26.413 IST [18235] LOG: invalid record length at
0/FFFFEA8: wanted 24, got 0

[2]
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index eb5115f..990a879 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3558,6 +3558,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr
lastredoptr, XLogRecPtr endptr,
XLogSegNo endlogSegNo;
XLogSegNo recycleSegNo;

+ return;
/* Initialize info about where to try to recycle to */
XLByteToSeg(endptr, endlogSegNo, wal_segment_size);
recycleSegNo = XLOGfileslop(lastredoptr);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
test_contrewrite_cascade.sh text/x-sh 1.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-08-29 10:21:30 Re: logical decoding and replication of sequences
Previous Message David Rowley 2022-08-29 09:42:05 Re: Reducing the chunk header sizes on all memory context types