Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Pengchengliu <pengchengliu(at)tju(dot)edu(dot)cn>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
Date: 2021-07-13 10:23:48
Message-ID: CALT9ZEHBemL0LpBAyTUkCq_s8JMDYsRZmQkXQCfeNkDj+XUZ7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> > So yeah, I think this is due to confusion with two snapshots and
> > failing
> > to consider both of them when calculating TransactionXmin.
> >
> > But I think removing one of the snapshots (as the v2 does it) is rather
> > strange too. I very much doubt having both the transaction and active
> > snapshots in the parallel worker is not intentional, and Pavel may very
> > well be right this breaks isolation levels that use the xact snapshot
> > (i.e. REPEATABLE READ and SERIALIZABLE). I haven't checked, though.
> >
> > So I think we need to keep both snapshots, but fix TransactionXmin to
> > consider both of them - I suppose ParallelWorkerMain could simply look
> > at the two snapshots, and use the minimum. Which is what [1] (already
> > linked by Pavel) talks about, although that's related to concerns about
> > one of the processes dying, which is not what's happening here.
> >
> >
> > I'm wondering what consequences this may have on production systems,
> > though. We've only seen this failing because of the assert, so what
> > happens when the build has asserts disabled?

> Looking at SubTransGetTopmostTransaction, it seems the while loop ends
> > immediately (because it's pretty much the opposite of the assert), so
> > we
> > just return the "xid" as topmost XID. But will that produce incorrect
> > query results, or what?
>
I haven't seen anything incorrect on production systems with asserts turned
off, by I should note this assertion is not immediately reproduced so it is
not that easy to catch the possible logical inconsequences of parallel scan
results. As I've noticed that subxids cache is used in most cases and even
this assertion is also rare in parallel scans. Maybe that is why we lived
with this bug unnoticed so long.

/www.postgresql.org/message-id/20150208002027.GH9201%40alap3.anarazel.de
> <https://www.postgresql.org/message-id/20150208002027.GH9201%40alap3.anarazel.de>
>
> PFA v4 patch based on the ideas above.
>
> In principle I see little difference with v3. But I agree it is more
> correct.
>
> I did test this patch on Linux and MacOS using testing algo above and
> got no error. On master branch before the patch I still see this error.
>

I've tested the patch and the error doesn't appear anymore.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-07-13 10:26:42 Re: Introduce pg_receivewal gzip compression tests
Previous Message Masahiro Ikeda 2021-07-13 10:13:17 Re: Fix comments of heap_prune_chain()