Re: ERROR: subtransaction logged without previous top-level txn record

From: Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To: "Hsu\, John" <hsuchen(at)amazon(dot)com>
Cc: "pgsql-bugs\(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: ERROR: subtransaction logged without previous top-level txn record
Date: 2019-10-24 09:59:30
Message-ID: 87ftjifoql.fsf@ars-thinkpad
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


Our customer also encountered this issue and I've looked into it. The problem is
reproduced well enough using the instructions in the previous message.

The check leading to this ERROR is too strict, it forbids legit behaviours. Say
we have in WAL

[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]

- First xl_xact_assignment record is beyond reading, i.e. earlier
restart_lsn, where ready snapshot will be taken from disk.
- After restart_lsn there is some change of a subxact.
- After that, there is second xl_xact_assignment (for another subxact)
revealing relationship between top and first subxact, where this ERROR fires.

Such transaction won't be streamed because we hadn't seen it in full. It must be
finished before streaming will start, i.e. before confirmed_flush_lsn.

Of course, the easiest fix is to just throw the check out. However, supposing
that someone would probably want to relax it instead, I considered ways to
accomplish this. Something like 'if we are still in SNAPSHOT_FULL and xid is
before SnapBuildNextPhaseAt, just ignore xl_xact_assignment record, we haven't
seen such xact in full and definitely won't stream it.' That led to discovery of
another bug in the place which I had found suspicious long before.

Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot (snapshot by itself in FULL state is just good as in
CONSISTENT), but also reorderbuffer filled with all currently running
xacts. This is painless for decoding sessions with existing slots because they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
- New slot creation reserves point in WAL since which it would read it
- It logs xl_running_xacts to start assembling a snapshot;
- Running decoding session with another slot quickly reads this
xl_running_xacts and serializes its snapshot;
- New slot reads xl_running_xacts and picks this snapshot up, saying that it
is ready to stream henceforth, though its reorderbuffer is empty.

Exact reproducing steps:

-- session 1
create table t (i int);
select pg_create_logical_replication_slot('slot_1', 'test_decoding');

-- session 2
insert into t values (1);

-- session 3, start slot creation
select pg_create_logical_replication_slot('slot_2', 'test_decoding');
-- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);

-- session 1
-- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
-- need to sleep; our snap will be immediately serialized there
SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');

-- continue slot_2 creation

-- session 2: insert some more and commit
insert into t values (1);

-- now this would find second insert, but not the first one
SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');

What we can do here? Initially I was like, ok, then let's get into FULL_SNAPSHOT
upon deserializing the snap and wait for all xacts finish as usual. However, to
my surprise I've found this impossible. That is, snapbuilder has no way to
enforce that we go into CONSISTENT only when we have seen all running xacts
completely without risk of skipping legit transactions. Specifically, after
deserializing FULL snapshot snapbuilder must iterate over WAL further until all
running xacts finish, as we must see with correct snapshots all changes of every
transaction we are going to stream. However, snapbuilder can't *immediately*
notice this point, because
- Snapbuilder updates xmin (first running xact) by taking it from xl_running_xacts
(c.f. SnapBuildProcessRunningXacts). Even if we guarantee that, for
each possible WAL reading starting position, there is always an an
xl_running_xacts records logged right before the earliest possible
streaming point -- IOW, after all xacts which we can't stream had
finished (which is currently true btw, as slot's advancement is
considered only at xl_running_xacts) -- that would not be enough due
to races around xl_running_xacts, i.e with WAL like
[ <T1> <restart_lsn> <T1 commit> <confirmed_flush_lsn, xrx> <T2 commit> ]
T2 might be skipped if T1 is shown as running in xl_running_xacts.
- Tracking xmin manually by recoding commits is not only inefficient,
it just not feasible because serialized snapshot is not full: it
contains only committed catalog-modifying xacts. Thus, we can't
distinguish non-catalog-modifying xact committed before serialized
snapshot from not yet committed one.

Which means only code external to snapbuilder knows the earliest point suitable
for streaming; slot advancement machinery ensures that <restart_lsn,
confirmed_flush_lsn> pair is always good. So possible fix is the following: if
snapbuilder's user knows exact LSN since which streaming is safe (existing slot,
essentially), trust him and and switch into CONSISTENT state after deserializing
snapshot as before. OTOH, if he doesn't know it (new slot creation), go via
usual FULL -> CONSISTENT procedure; we might transition into CONSISTENT a bit
later than it became possible, but there is nothing bad about that.

First attached patch implements this. I don't particularly like it, but the only
alternative which I see is to rework slots advancement logic to make
<restart_lsn, confirmed_flush_lsn> pair such that there is always
xl_running_xacts before confirmed_flush_lsn which confirms all xacts running as
of restart_lsn have finished. This unnecessary complexity looks much worse.

As for the check in the topic, I nonetheless propose to remove it completely, as
in second attached patch. Saying for sure whether xact of some record
encountered after snapshot was deserialized can be streamed or not requires to
know nextXid (first not yet running xid) as of this snapshot's lsn -- all xids <
nextXid possibly hadn't been seen in full and are not subject to
decoding. However, generally we don't know nextXid which is taken from
xl_running_xacts; in particular snapshot can be serizalized/deserialized at
XLOG_END_OF_RECOVERY. Changing that for the sake of the check in question is not
worthwhile, so just throw it out instead of trying to relax.

In fact, I don't see what is so important about seeing the top xact first at
all. During COMMIT decoding we'll know all subxids anyway, so why care?

P.S. While digging this, I have noted that return values of
SnapBuildFindSnapshot seem pretty random to me. Basically returning 'true'
performs immediately 4 things:
- update xmin
- purge old xip entries
- advance xmin of the slot
- if CONSISTENT, advance lsn (earliest serialized snap)

The latter two make sense only after slot created or confirmed_flush_lsn
reached. The first two make sense even immediately after deserializing the
snapshot (because it is serialized *before* updating xmin and xip); generally,
always when committed xids are tracked. Then why cleanup is done when xmin
horizon is too low? Why it is not performed after restoring serialized snapshot?
On the whole, I find this not very important as all these operations are pretty
cheap and harmless if executed too early -- it would be simpler just do them

Arseny Sher
Postgres Professional:
The Russian Postgres Company

Attachment Content-Type Size
0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch text/x-diff 13.4 KB
0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch text/x-diff 2.0 KB

In response to


Browse pgsql-bugs by date

  From Date Subject
Next Message Skjalg A. Skagen 2019-10-24 11:06:01 PostgreSQL 12 installation fails because locale name contained non-english characters
Previous Message PG Bug reporting form 2019-10-24 07:25:50 BUG #16075: The favicon of is vague on MacBook Pro with retina display

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-10-24 10:01:50 Re: v12.0: ERROR: could not find pathkey item to sort
Previous Message Thunder 2019-10-24 09:37:52 Re:Re: [BUG] standby node can not provide service even it replays all log files