Re: LOG: invalid record length at <LSN> : wanted 24, got 0

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Harinath Kanchu <hkanchu(at)apple(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: LOG: invalid record length at <LSN> : wanted 24, got 0
Date: 2023-03-01 06:35:58
Message-ID: CALj2ACW+vHcgZntw_JHtWfkDx64JS3_eiQNoRgNynK-uGM2j5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 1, 2023 at 10:51 AM Harinath Kanchu <hkanchu(at)apple(dot)com> wrote:
>
> Hello,
>
> We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 days.
>
> The standby suddenly disconnects from the primary, and it throws the error “LOG: invalid record length at <LSN>: wanted 24, got0”.

Firstly, this isn't an error per se, especially for a standby as it
can get/retry the same WAL record from other sources. It's a bit hard
to say anything further just by looking at this LOG message, one needs
to look at what's happening around the same time. You mentioned that
the connection to primary was lost, so you need to dive deep as to why
it got lost. If the connection was lost half-way through fetching the
WAL record, the standby may emit such a LOG message.

Secondly, you definitely need to understand why the connection to
primary keeps getting lost - network disruption, parameter changes or
primary going down, standby going down etc.?

> And then it tries to restore the WAL file from the archive. Due to low write activity on primary, the WAL file will be switched and archived only after 1 hr.
>
> So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE without replicating the primary.
>
> Due to this there will be write outage as the standby is synchronous standby.

I understand this problem and there's a proposed patch to help with
this - https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3nJkvZT8qgfA@mail.gmail.com.

It basically allows one to set a timeout as to how much duration the
standby can restore from archive before switching to stream.
Therefore, in your case, the standby doesn't have to wait for 1hr to
connect to primary, but it can connect before that.

> We are using “wal_sync_method” as “fsync” assuming WAL file not getting flushed correctly.
>
> But this is happening even after making it as “fsync” instead of “fdatasync”.

I don't think that's a problem, unless wal_sync_method isn't changed
to something else in between.

> Restarting the STANDBY sometimes fixes this problem, but detecting this automatically is a big problem as the postgres standby process will be still running fine, but WAL RECEIVER process is up and down continuously due to switching of WAL sources.

Yes, the standby after failure to connect to primary, it switches to
archive and stays there until it exhausts all the WAL from the archive
and then switches to stream. You can monitor the replication slot of
the standby on the primary, if it's inactive, then one needs to jump
in. As mentioned above, there's an in-progress feature that helps in
these cases.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-03-01 06:37:01 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Michael Paquier 2023-03-01 06:31:48 Re: add PROCESS_MAIN to VACUUM