Re: Assertion failure in SnapBuildInitialSnapshot()

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()
Date: 2022-11-15 03:26:50
Message-ID: CA+hUKGLNx-_2ydQNfxU5gQSZXzkQx=R1QBjQ-z4Y948Uu7ZRnQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 15, 2022 at 3:38 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2022-11-14 17:25:31 -0800, Andres Freund wrote:
> Another theory: I dimly remember Thomas mentioning that there's some different
> behaviour of xlogreader during shutdown as part of the v15 changes. I don't
> quite remember what the scenario leading up to that was. Thomas?

Yeah. So as mentioned in:

https://www.postgresql.org/message-id/CA%2BhUKG%2BWKsZpdoryeqM8_rk5uQPCqS2HGY92WiMGFsK2wVkcig%40mail.gmail.com

I still have on my list to remove a new "missing contrecord" error
that can show up in a couple of different scenarios that aren't
exactly error conditions depending on how you think about it, but I
haven't done that yet. I am not currently aware of anything bad
happening because of those messages, but I could be wrong.

> It's certainly interesting that we see stuff like:
>
> 2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
> 2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT: START_REPLICATION SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 0/1D2B650 (proto_version '3', origin 'any', publication_names '"testpub"')
> ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
> 2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR: error while shutting down streaming COPY: ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710

Right, so that might fit the case described in my email above:
logical_read_xlog_page() notices that it has been asked to shut down
when it is between reads of pages with a spanning contrecord. Before,
it would fail silently, so XLogReadRecord() returns NULL without
setting *errmsg, but now it complains about a missing contrecord. In
the case where it was showing up on that other thread, just a few
machines often seemed to log that error when shutting down --
peripatus for example -- I don't know why, but I assume something to
do with shutdown timing and page alignment.

> It could entirely be caused by postmaster slowly killing processes after the
> assertion failure and that that is corrupting shared memory state though. But
> it might also be related.

Hmm.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2022-11-15 03:44:24 Re: libpq compression (part 2)
Previous Message Andres Freund 2022-11-15 02:50:15 Re: Direct I/O