From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com |
Subject: | Re: Assertion failure in SnapBuildInitialSnapshot() |
Date: | 2022-11-15 02:38:37 |
Message-ID: | 20221115023837.cxbawee6pbtdg5yp@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2022-11-14 17:25:31 -0800, Andres Freund wrote:
> Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of
> SnapBuildExportSnapshot()? Why aren't the error checks for
> SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which
> transactions are we even in when we import the snapshot (cf.
> SnapBuildExportSnapshot() doing a StartTransactionCommand()).
Most of the checks for that are in CreateReplicationSlot() - but not al,
e.g. XactReadOnly isn't set, nor do we enforce in an obvious place that we
don't already hold a snapshot.
I first thought this might directly explain the problem, due to the
MyProc->xmin assignment in SnapBuildInitialSnapshot() overwriting a value that
could influence the return value for GetOldestSafeDecodingTransactionId(). But
that happens later, and we check that MyProc->xmin is invalid at the start.
But it still seems supicious. This will e.g. influence whether
StartupDecodingContext() sets PROC_IN_LOGICAL_DECODING. Which probably is
fine, but...
Another theory: I dimly remember Thomas mentioning that there's some different
behaviour of xlogreader during shutdown as part of the v15 changes. I don't
quite remember what the scenario leading up to that was. Thomas?
It's certainly interesting that we see stuff like:
2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT: START_REPLICATION SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 0/1D2B650 (proto_version '3', origin 'any', publication_names '"testpub"')
ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR: error while shutting down streaming COPY: ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
It could entirely be caused by postmaster slowly killing processes after the
assertion failure and that that is corrupting shared memory state though. But
it might also be related.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2022-11-15 02:50:15 | Re: Direct I/O |
Previous Message | Amit Langote | 2022-11-15 02:04:28 | Re: List of Bitmapset (was Re: ExecRTCheckPerms() and many prunable partitions) |