Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

From: Greg Nancarrow <gregn4422(at)gmail(dot)com>
To: Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>
Cc: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Pengchengliu <pengchengliu(at)tju(dot)edu(dot)cn>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
Date: 2021-06-23 02:11:06
Message-ID: CAJcOf-engk+dyWJcdZzNZct1aGG2gqTbPzorDeTARKxrP89QAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 23, 2021 at 1:06 AM Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru> wrote:
>
> The analysis in the beginning of the discussion seems to be right, but
> the fix v2 looks too invasive for me.
>
> Personally, I'd like not to remove snapshot even if transaction is
> read-only. I propose to consider "xid < TransactionXmin" as a legit case
> and just promote xid to TransactionXmin.
>
> It's annoying this old bug still not fixed. What do you think?

This v3 patch doesn't look right to me at all.
It's not addressing the fundamental problem, it just seems to be
working around it, by fiddling a xid value to avoid an assertion
failure.

I really can't see how the v2 patch "removes a snapshot" in the
read-only transaction case, and is "invasive".
Have you found a case that the v2 patch breaks?

The v2 patch does follow the analysis in the beginning of the
discussion, which identified that in setting up parallel workers, a
"later transaction snapshot" was taken than the one actually used in
the statement execution, and that's what eventually leads to the
observed Assertion failure.
The original problem reporter stated: "So the root cause is the
Parallel Workers process set the TransactionXmin with later
transcation snapshot".
Also, as far as I can see, it makes no sense to pass parallel workers
both an active snapshot and a (later) transaction snapshot. In the
leader, prior to execution and running parallel workers, a transaction
snapshot is obtained and pushed as the active snapshot (note: only ONE
snapshot at this point). It makes little sense to me to then obtain
ANOTHER transaction snapshot when setting up parallel workers, and
pass that (only) to the workers along with the original (earlier)
active snapshot. (If there is a reason for passing the TWO snapshots
to parallel workers, original code authors and/or snapshot experts
should speak up now, and code comments should be added accordingly to
explain the purpose and how it is MEANT to work)
This is why the v2 patch updates the code to just pass one snapshot
(the active snapshot) to the parallel workers, which gets restored in
each worker as the transaction snapshot and set as the active snapshot
(so is then consistent with the snapshot setup in the parallel
leader).
The v3 patch doesn't address any of this at all. It is just checking
if the xid is < the TransactionXMin in the snapshot, and if so,
setting the xid to be TransactionXMin, and thus avoiding the Assertion
failure. But that TransactionXMin was from the "later transaction
snapshot", which was only obtained for the workers and doesn't
necessarily match the earlier active snapshot taken, that is
associated with the actual statement execution. As far as I am
concerned, that "later transaction snapshot", taken for use by the
workers, doesn't seem right, so that's why the v2 patch removes the
use of it and only uses the active snapshot, for consistency with the
rest of the code. I think that if there was something fundamentally
wrong with the v2 patch's removal of that "later transaction snapshot"
used by the parallel workers, then surely there would be a test
failure somewhere - but that is not the case.
I think the v2 patch should be restored as the proposed solution here.

Regards,
Greg Nancarrow
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-06-23 02:37:43 Re: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE
Previous Message houzj.fnst@fujitsu.com 2021-06-23 01:04:51 RE: [bug?] Missed parallel safety checks, and wrong parallel safety