Re: ERROR: subtransaction logged without previous top-level txn record

From: Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Hsu\, John" <hsuchen(at)amazon(dot)com>, "pgsql-bugs\(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: ERROR: subtransaction logged without previous top-level txn record
Date: 2020-02-03 13:46:05
Message-ID: 871rrb942q.fsf@ars-thinkpad
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers


Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:

> So, doesn't this mean that it started occurring after the fix done in
> commit 96b5033e11 [1]? Because before that fix we wouldn't have
> allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
> SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that
> commit is wrong, but just trying to understand the situation here.

Nope. Consider again example of WAL above triggering the error:

[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]

Decoder starting reading WAL at <restart_lsn> where he immediately reads
from disk snapshot serialized earlier, which makes it jump to
SNAPBUILD_CONSISTENT right away. It doesn't read xl_xact_assignment_1,
but it reads xl_xact_assignment_2 already in SNAPBUILD_CONSISTENT state,
so catches the error regardless of this commit.

>> Well, almost. This is true as long initial snapshot construction process
>> goes the long way of building the snapshot by itself. If it happens to
>> pick up from disk ready snapshot pickled there by another decoding
>> session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
>> bug as described in
>> https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad
>>
>
> Can't we deal with this separately? If so, I think let's not mix the
> discussions for both as the root cause of both seems different.

These issues are related: before removing the check it would be nice to
ensure that there is no bugs it might protect us from (and it turns out
there actually is, though it won't always protect, and though this bug
has very small probability). Moreover, they are about more or less
subject -- avoiding partially decoded xacts -- and once you dived deep
enough to deal with one, it is reasonable to deal with another instead
of doing that twice. But as a practical matter, removing the check is
simple one-liner, and its presence causes people troubles -- so I'd
suggest doing that first and then deal with the rest. I don't think
starting new thread is worthwhile here, but if you think it does, I can
create it.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2020-02-03 14:15:42 Re: BUG #16171: Potential malformed JSON in explain output
Previous Message Daniel Gustafsson 2020-02-03 13:03:04 Re: Unable to trigger createdb

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-02-03 13:49:18 Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node
Previous Message Andres Freund 2020-02-03 13:23:19 Re: Cache relation sizes?