Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

From: Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Pengchengliu <pengchengliu(at)tju(dot)edu(dot)cn>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
Date: 2021-07-13 10:10:34
Message-ID: 150825360832da68872abb4d891dfc3a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021-07-09 20:36, Tomas Vondra wrote:
> Hi,
>
> I took a quick look on this - I'm no expert in the details of
> snapshots,
> so take my comments with a grain of salt.
>
> AFAICS both Greg Nancarrow and Pavel Borisov are kinda right. I think
> Greg is right the v3 patch does not seem like the right (or correct)
> solution, for a couple reasons:
>
>
> 1) It fixes the symptoms, not the cause. If we set TransactionXmin to a
> bogus value, this only fixes it locally in
> SubTransGetTopmostTransaction
> but I'd be surprised if other places did not have the same issue.
>
>
> 2) The xid/TransactionXmin comparison in the v2 fix:
>
> xid = xid > TransactionXmin ? xid : TransactionXmin;
>
> seems broken, because it compares the XIDs directly, but that's not
> going to work after generating enough transactions.
>
>
> 3) But even if this uses TransactionIdFollowsOrEquals, it seems very
> strange because the "xid" comes from
>
> XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
>
> i.e. it's the raw xmin from the tuple, so why should we be setting it
> to
> TransactionXmin? That seems pretty strange, TBH.
>
>
> So yeah, I think this is due to confusion with two snapshots and
> failing
> to consider both of them when calculating TransactionXmin.
>
> But I think removing one of the snapshots (as the v2 does it) is rather
> strange too. I very much doubt having both the transaction and active
> snapshots in the parallel worker is not intentional, and Pavel may very
> well be right this breaks isolation levels that use the xact snapshot
> (i.e. REPEATABLE READ and SERIALIZABLE). I haven't checked, though.
>
> So I think we need to keep both snapshots, but fix TransactionXmin to
> consider both of them - I suppose ParallelWorkerMain could simply look
> at the two snapshots, and use the minimum. Which is what [1] (already
> linked by Pavel) talks about, although that's related to concerns about
> one of the processes dying, which is not what's happening here.
>
>
> I'm wondering what consequences this may have on production systems,
> though. We've only seen this failing because of the assert, so what
> happens when the build has asserts disabled?
>
> Looking at SubTransGetTopmostTransaction, it seems the while loop ends
> immediately (because it's pretty much the opposite of the assert), so
> we
> just return the "xid" as topmost XID. But will that produce incorrect
> query results, or what?
>
>
>
> regards
>
>
> [1]
> https://www.postgresql.org/message-id/20150208002027.GH9201%40alap3.anarazel.de

PFA v4 patch based on the ideas above.

In principle I see little difference with v3. But I agree it is more
correct.

I did test this patch on Linux and MacOS using testing algo above and
got no error. On master branch before the patch I still see this error.

---
Best regards,
Maxim Orlov.

Attachment Content-Type Size
v4-0001-PG14-Fix-parallel-worker-failed-assertion-and-coredump.patch text/x-diff 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiro Ikeda 2021-07-13 10:13:17 Re: Fix comments of heap_prune_chain()
Previous Message Masahiro Ikeda 2021-07-13 10:05:10 Re: Fix comments of heap_prune_chain()