Re: pgsql: Avoid duplicate XIDs at recovery when building initial snapshot

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pgsql: Avoid duplicate XIDs at recovery when building initial snapshot
Date: 2018-10-22 03:03:26
Message-ID: 20181022030326.GE14282@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

(moving to -hackers)

On Sun, Oct 14, 2018 at 10:42:40AM -0700, Andres Freund wrote:
> I'm unhappy this approach was taken over objections. Without a real
> warning.

Oops, that was not clear to me. Sorry about that! I did not see you
objecting again after the last arguments I raised. The end of PREPARE
TRANSACTION is designed like that since it has been introduced, so
changing the way the dummy GXACT is inserted before the main one is
cleared from its XID does not sound wise to me. The current design is
also present for a couple of reasons, please see this thread:
https://www.postgresql.org/message-id/13905.1119109281@sss.pgh.pa.us
This has resulted in e26b0abd.

Among the things I thought are:
- Clearing the XID at the same time the dummy entry is added, which
actually means to hold on ProcArrayLock longer while doing more at the
end of prepare. I actually don't think you can do that cleanly without
endangering the transaction visibility for other backends, and syncrep
may cause the window to get wider.
- Changing GetRunningTransactionData so as duplicates are removed at
this stage. However this also requires to hold ProcArrayLock for
longer. For most deployments, if no dummy entries from 2PC transactions
are present it would be possible to bypass the checks to remove the
duplicated entries, but if at least one dummy entry is found it would be
necessary to scan again the whole ProcArray, which would be most likely
the case at each checkpoint with workloads like what Konstantin has
mentioned in the original report. And ProcArrayLock is already a point
of contention for many OLTP workloads with small transactions. So the
performance argument worries me.

Speaking of which, I have looked at the performance of qsort, and for a
couple of thousand entries, we may not see any impact. But I am not
confident enough to say that it would be OK for all platforms each time
a standby snapshot is taken when ordering works on 4-bytes elements, so
the performance argument from Konstantin seems quite sensible to me (see
the quickly-hacked qsort_perf.c attached).

> Even leaving the crummyness aside, did you check other users of
> XLOG_RUNNING_XACTS, e.g. logical decoding?

I actually spent some time checking that, so it is not innocent.
SnapBuildWaitSnapshot() waits for transactions to commit or abort based
on the XIDs in the record. And that's the only place where those XIDs
are used so it won't matter to wait twice for the same transaction to
finish. The error callback would be used only once.
--
Michael

Attachment Content-Type Size
qsort_perf.c text/x-csrc 812 bytes

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Amit Langote 2018-10-22 03:09:18 Re: pgsql: Set pg_class.relhassubclass for partitioned indexes
Previous Message Tom Lane 2018-10-22 02:59:27 Re: pgsql: Set pg_class.relhassubclass for partitioned indexes

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-10-22 03:04:26 Re: found xmin x from before relfrozenxid y
Previous Message Amit Langote 2018-10-22 02:50:32 Re: relhassubclass and partitioned indexes