Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Greg Nancarrow <gregn4422(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>, Pengchengliu <pengchengliu(at)tju(dot)edu(dot)cn>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
Date: 2021-08-03 14:00:05
Message-ID: CALT9ZEFA47XhPPZ9ggcnQi=++otKsZVOJgT+WjXQqfmF-YCfRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

пт, 23 июл. 2021 г. в 10:00, Greg Nancarrow <gregn4422(at)gmail(dot)com>:

> On Thu, Jul 22, 2021 at 2:15 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > Thanks to Thomas Munro for drawing my attention to this thread. I
> > wasn't intentionally ignoring it, but there's a lot of email in the
> > world and only so much time.
> >
> > When I wrote this code originally, the idea that I had in mind was
> > simply this: whatever state we have in the leader ought to be
> > reproduced in each worker. So if there's an active snapshot in the
> > leader, we ought to make that active in all the workers, and if
> > there's a transaction snapshot in the leader, we ought to make that
> > the transaction snapshot in all of the workers.
> >
> > But I see now that my thinking was fuzzy, and I'm going to blame that
> > on the name GetTransactionSnapshot() being slightly misleading. If
> > IsolationUsesXactSnapshot() is true, then there's really such a thing
> > as a transaction snapshot and reproducing that in the worker is a
> > sensible thing to do. But when !IsolationUsesXactSnapshot(),
> > GetTransactionSnapshot() doesn't just "get the transaction snapshot",
> > because there isn't any such thing. It takes a whole new snapshot, on
> > the theory that you wouldn't be calling this function unless you had
> > finished up with the snapshot you got the last time you called this
> > function. And in the case of initiating parallel query, that is the
> > wrong thing.
> >
> > I think that, at least in the case where IsolationUsesXactSnapshot()
> > is true, we need to make sure that calling GetTransactionSnapshot() in
> > a worker produces the same result that it would have produced in the
> > leader. Say one of the workers calls an sql or plpgsql function and
> > that function runs a bunch of SQL statements. It seems to me that
> > there's probably a way for this to result in calls inside the worker
> > to GetTransactionSnapshot(), and if that doesn't return the same
> > snapshot as in the leader, then we've broken MVCC.
> >
> > What about when IsolationUsesXactSnapshot() is false? Perhaps it's OK
> > to just skip this altogether in that case. Certainly what we're doing
> > can't be right, because copying a snapshot that wouldn't have been
> > taken without parallel query can't ever be the right thing to do.
> > Perhaps we need to copy something else instead. I'm not really sure.
> >
> > So I think v2 is probably on the right track, but wrong when the
> > transaction isolation level is REPEATABLE READ or SERIALIZABLE, and v3
> > and v4 just seem like unprincipled hacks that try to avoid the
> > assertion failure by lying about whether there's a problem.
> >
>
> Many thanks for taking time to respond to this (and thanks to Thomas Munro
> too).
> It's much appreciated, as this is a complex area.
> For the time being, I'll reinstate the v2 patch (say as v5) as a
> partial solution, and then work on addressing the REPEATABLE READ and
> SERIALIZABLE transaction isolation levels, which you point out are not
> handled correctly by the patch.
>
`
I've looked at v5 patch. It is completely similar to v2 patch, which I've
already tested using the workflow, I've described in the comments above.
Before the patch I get the errors quite soon, the patch corrects them.
Furthermore I've tested the same patch under REPEATABLE READ and
SERIALIZABLE and detected no flaws. So, now, when we've got Robert's
explanation, it seems to me that v2 (aka v5) patch can be committed
(leaving possible REPEATABLE READ and SERIALIZABLE improvements for the
future). I don't really sure it is still possible on 07/21 СF, but I'd
change status of the patch to the ready-for-committer. Also I'd like the
bugfix to be backported to the previous PG versions.

Further consideration on the patch and on the backporting is very much
welcome!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-08-03 14:10:23 Re: Use generation context to speed up tuplesorts
Previous Message Tom Lane 2021-08-03 13:39:50 Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace