Re: A few patches to clarify snapshot management

From: Noah Misch <noah(at)leadboat(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A few patches to clarify snapshot management
Date: 2025-08-09 22:23:38
Message-ID: 20250809222338.cc.nmisch@google.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 07, 2025 at 11:55:00AM +0200, Heikki Linnakangas wrote:
> On 07/01/2025 00:00, Andres Freund wrote:
> > On 2024-12-20 19:31:01 +0200, Heikki Linnakangas wrote:
> > > While playing around some more with this, I noticed that this code in
> > > GetTransactionSnapshot() is never reached, and AFAICS has always been dead
> > > code:
> > >
> > > > Snapshot
> > > > GetTransactionSnapshot(void)
> > > > {
> > > > /*
> > > > * Return historic snapshot if doing logical decoding. We'll never need a
> > > > * non-historic transaction snapshot in this (sub-)transaction, so there's
> > > > * no need to be careful to set one up for later calls to
> > > > * GetTransactionSnapshot().
> > > > */
> > > > if (HistoricSnapshotActive())
> > > > {
> > > > Assert(!FirstSnapshotSet);
> > > > return HistoricSnapshot;
> > > > }
> > >
> > > when you think about it, that's good, because it doesn't really make sense
> > > to call GetTransactionSnapshot() during logical decoding. We jump through
> > > hoops to make the historic catalog decoding possible with historic
> > > snapshots, tracking subtransactions that modify catalogs and WAL-logging
> > > command ids, but they're not suitable for general purpose queries. So I
> > > think we should turn that into an error, per attached patch.
> >
> > Hm. I'm not sure it's a good idea to forbid this. Couldn't there be sane C
> > code in an output functions calling GetTransactionSnapshot() or such to do
> > some internal lookups?
>
> I haven't seen any. And I don't think that would work correctly while doing
> logical decoding anyway, because historical snapshots only track XIDs that
> modify catalogs. regclassout and enumout do work because they use the
> catalog snapshot rather than GetTransactionSnapshot().
>
> (I committed that change in commit 1585ff7387 already, but discussion is
> still welcome of course)

https://github.com/2ndQuadrant/pglogical does rely on the pre-1585ff7387 code
for its row_filter feature. row_filter calls ExecEvalExpr() from the output
plugin, e.g. to evaluate expression "id between 2 AND 4" arising from this
configuration in the pglogical test suite:

SELECT * FROM pglogical.replication_set_add_table('default', 'basic_dml', false, row_filter := $rf$id between 2 AND 4$rf$);

One of the GetTransactionSnapshot() calls is in pglogical_output_plugin.c
itself. For that, I could work around the change by forcing the old
HistoricSnapshot use:

- PushActiveSnapshot(GetTransactionSnapshot());
+ Assert(HistoricSnapshotActive());
+ PushActiveSnapshot(GetCatalogSnapshot(InvalidOid));

That doesn't get far. Calling a plpgsql function in the expression reaches
the "cannot take query snapshot during logical decoding" error via this stack
trace:

GetTransactionSnapshot at snapmgr.c:279:3
exec_eval_simple_expr at pl_exec.c:6214:3
(inlined by) exec_eval_expr at pl_exec.c:5699:6
exec_stmt_raise at pl_exec.c:3820:8
(inlined by) exec_stmts at pl_exec.c:2096:10
exec_stmt_block at pl_exec.c:1955:6
exec_toplevel_block at pl_exec.c:1646:7
plpgsql_exec_function at pl_exec.c:636:5
plpgsql_call_handler at pl_handler.c:278:11
fmgr_security_definer at fmgr.c:755:52
ExecInterpExpr at execExprInterp.c:927:7
pglogical_change_filter at pglogical_output_plugin.c:663:7
(inlined by) pg_decode_change at pglogical_output_plugin.c:691:7
change_cb_wrapper at logical.c:1121:22
ReorderBufferApplyChange at reorderbuffer.c:2078:3
(inlined by) ReorderBufferProcessTXN at reorderbuffer.c:2383:7
DecodeCommit at decode.c:743:3
(inlined by) xact_decode at decode.c:242:5
LogicalDecodingProcessRecord at decode.c:123:1
XLogSendLogical at walsender.c:3442:33
WalSndLoop at walsender.c:2837:7
StartLogicalReplication at walsender.c:1504:2
(inlined by) exec_replication_command at walsender.c:2158:6
PostgresMain at postgres.c:4762:10
BackendMain at backend_startup.c:80:2
postmaster_child_launch at launch_backend.c:291:3
BackendStartup at postmaster.c:3587:8
(inlined by) ServerLoop at postmaster.c:1702:6
PostmasterMain at postmaster.c:1252:6
main at main.c:165:4

> > Hm. I'm not sure it's a good idea to forbid this. Couldn't there be sane C
> > code in an output functions calling GetTransactionSnapshot() or such to do
> > some internal lookups?

I think pglogical_output_plugin.c w/ plpgsql is largely sane when used with a
plpgsql function that consults only catalogs and the output tuple. If the
pglogical test suite is representative, that's the usual case for a
row_filter. A plpgsql function that reads user tables will be fragile with
concurrent pruning, but a user might sanely accept that fragility. A plpgsql
function that writes tuples is not sane in a row_filter. How do you see it?

So far, I know of these options:

1. Make pglogical block the row_filter feature for any v18+ origin.
2. Revert postgresql.git commit 1585ff7387.
3. Make pglogical use HistoricSnapshot where pglogical_output_plugin.c handles
snapshots directly. That should keep simple row_filter expressions like
"col > 0" functioning. Entering plpgsql or similarly-complex logic will
fail with "cannot take query snapshot during logical decoding", and we'll
consider that to be working as intended.
4. Fail later and lazily, for just the most-unreasonable cases. For example,
fail when HistoricSnapshot applies to a write operation. (Maybe this
already fails. I didn't check.)

Which of those or other options should we consider?

For reference, https://github.com/2ndQuadrant/pglogical/pull/503 is an
otherwise-working port of pglogical to v18. Its Makefile currently disables
the tests that reach "cannot take query snapshot during logical decoding".

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kirill Reshke 2025-08-10 05:45:15 Re: VM corruption on standby
Previous Message Aleksander Alekseev 2025-08-09 20:54:42 Re: VM corruption on standby