Re: lastOverflowedXid does not handle transaction ID wraparound

From: Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>
To: Stan Hu <stanhu(at)gmail(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: lastOverflowedXid does not handle transaction ID wraparound
Date: 2021-11-02 06:47:08
Message-ID: CANNMO+Lu0_pW1D1gdz4qRB0Sr7q-R_ZRjFsQ89Ti8EXD2FopQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 25, 2021 at 11:41 AM Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>
wrote:

> On Thu, Oct 21, 2021 at 07:21 Stan Hu <stanhu(at)gmail(dot)com> wrote:
>
>> On Wed, Oct 20, 2021 at 9:01 PM Kyotaro Horiguchi
>> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
>> >
>> > lastOverflowedXid is the smallest subxid that possibly exists but
>> > possiblly not known to the standby. So if all top-level transactions
>> > older than lastOverflowedXid end, that means that all the
>> > subtransactions in doubt are known to have been ended.
>>
>> Thanks for the patch! I verified that it appears to reset
>> lastOverflowedXid properly.
>
> ...

> Any ideas in the direction of observability?
>

Perhaps, anything additional should be considered separately.

The behavior discussed here looks like a bug.

I also have tested the patch. It works fully as expected, details of
testing – below.

I think this is a serious bug hitting heavily loaded Postgres setups with
hot standbys
and propose fixing it in all supported major versions ASAP since the fix
looks simple.

Any standby in heavily loaded systems (10k+ TPS) where subtransactions are
used
may experience huge performance degradation on standbys [1]. This is what
happened
recently with GitLab [2]. While a full solution to this problem is
something more complex, probably
requiring changes in SLRU [3], the problem discussed here definitely feels
like a serious bug
– if we fully get rid of subtransactions, since 32-bit lastOverflowedXid is
not reset, in new
XID epoch standbys start experience SubtransControlLock/SubtransSLRU again

without any subtransactions. This problem is extremely difficult to
diagnose on one hand,
and it may fully make standbys irresponsible while a long-lasting
transaction last on the primary
("long" here may be a matter of minutes or even dozens of seconds – it
depends on the
TPS level). It is especially hard to diagnose in PG 12 or older – because
it doesn't have
pg_stat_slru yet, so one cannot easily notice Subtrans reads.)

The only current solution to this problem is to restart standby Postgres.

How I tested the patch. First, I reproduced the problem:
- current 15devel Postgres, installed on 2 x c5ad.2xlarge on AWS (8 vCPUs,
16 GiB), working as
primary + standby
- follow the steps described in [3] to initiate SubtransSLRU on the standby
- at some point, stop using SAVEPOINTs on the primary - use regular UPDATEs
instead, wait.

Using the following, observe procArray->lastOverflowedXid:

diff --git a/src/backend/storage/ipc/procarray.c
b/src/backend/storage/ipc/procarray.c
index
bd3c7a47fe21949ba63da26f0d692b2ee618f885..ccf3274344d7ba52a6f28a10b08dbfc310cf97e9
100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2428,6 +2428,9 @@ GetSnapshotData(Snapshot snapshot)
subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
xmax);

+ if (random() % 100000 == 0)
+ elog(WARNING, "procArray->lastOverflowedXid: %u",
procArray->lastOverflowedXid);
+
if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
suboverflowed = true;
}

Once we stop using SAVEPOINTs on the primary, the
value procArray->lastOverflowedXid stop
changing, as expected.

Without the patch applied, lastOverflowedXid remains constant forever –
till the server restart.
And as I mentioned, we start experiencing SubtransSLRU and pg_subtrans
reads.

With the patch, lastOverflowedXid is reset to 0, as expected, shortly after
an ongoing "long"
the transaction ends on the primary.

This solves the bug – we don't have SubtransSLRU on standby without actual
use of subtransactions
on the primary.

[1]
https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful
[2]
https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/
[3]
https://www.postgresql.org/message-id/flat/494C5E7F-E410-48FA-A93E-F7723D859561%40yandex-team.ru#18c79477bf7fc44a3ac3d1ce55e4c169
[4]
https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/21

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikolay Samokhvalov 2021-11-02 06:54:31 Re: lastOverflowedXid does not handle transaction ID wraparound
Previous Message Sasasu 2021-11-02 06:22:39 Re: XTS cipher mode for cluster file encryption