Re: Slow standby snapshot

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Slow standby snapshot
Date: 2021-11-07 11:37:39
Message-ID: 5FDD25C2-7806-490A-8D28-483F1A70D8D5@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry for so late reply. I've been thinking about possible approaches.
KnownAssignedXids over hashtable in fact was implemented long before and rejected [0].

> 3 авг. 2021 г., в 22:35, Andres Freund <andres(at)anarazel(dot)de> написал(а):
>
> On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
>>> 3 авг. 2021 г., в 03:01, Andres Freund <andres(at)anarazel(dot)de> написал(а):
>>> On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
>>>> The main idea is simple optimistic optimization - store offset to next
>>>> valid entry. So, in most cases, we could just skip all the gaps.
>>>> Of course, it adds some additional impact for workloads without long
>>>> (few seconds) transactions but it is almost not detectable (because of
>>>> CPU caches).
>>>
>>> I'm doubtful that that's really the right direction. For workloads that
>>> are replay heavy we already often can see the cost of maintaining the
>>> known xids datastructures show up significantly - not surprising, given
>>> the datastructure. And for standby workloads with active primaries the
>>> cost of searching through the array in all backends is noticeable as
>>> well. I think this needs a bigger data structure redesign.
>>
>> KnownAssignedXids implements simple membership test idea. What kind of
>> redesign would you suggest? Proposed optimisation makes it close to optimal,
>> but needs eventual compression.
>
> Binary searches are very ineffecient on modern CPUs (unpredictable memory
> accesses, unpredictable branches). We constantly need to do binary searches
> during replay to remove xids from the array. I don't see how you can address
> that with just the current datastructure.
Current patch addresses another problem. In presence of old enough transaction enumeration of KnownAssignedXids with shared lock prevents adding new transactions with exclusive lock. And recovery effectively pauses.

Binary searches can consume 10-15 cache misses, which is unreasonable amount of memory waits. But that's somewhat different problem.
Also binsearch is not that expensive when we compress KnownAssignedXids often.

>> Maybe use a hashtable of running transactions? It will be slightly faster
>> when adding\removing single transactions. But much worse when doing
>> KnownAssignedXidsRemove().
>
> Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
> write KnownAssignedXidsGet[AndSetXmin]()?
I was thinking about inefficient KnownAssignedXidsRemovePreceding() in hashtable. But, probably, this is not so frequent operation.

>> Maybe use a tree? (AVL\RB or something like that) It will be slightly better, because it does not need eventual compression like exiting array.
>
> I'm not entirely sure what datastructure would work best. I can see something
> like a radix tree work well, or a skiplist style approach. Or a hashtable:
>
> I'm not sure that we need to care as much about the cost of
> KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
> less frequent. But more importantly, it'd not be hard to maintain an
> occasionally (or opportunistically) maintained denser version for
> GetSnapshotData() - there's only a single writer, making the concurrency
> issues a lot simpler.

I've been prototyping Radix tree for a while.
Here every 4 xids are summarized my minimum Xid and number of underlying Xids. Of cause 4 is arbitrary number, summarization area must be of cacheline size.
┌───────┐
│ 1 / 9 │
├───────┴────┐
│ └────┐
│ └────┐
│ └────┐
▼ └───▶
┌───────────────────────────────┐
│ 1 / 3 | 5 / 0 | 9 / 3 | D / 3 │
├───────┬───────┬────────┬──────┴────┐
│ └─┐ └───┐ └────┐ └─────┐
│ └─┐ └──┐ └────┐ └─────┐
│ └─┐ └──┐ └────┐ └────┐
▼ └─┐ └──┐ └───┐ └────┐
┌───────────────▼────────────┴─▶────────────┴──▶───────────┴───▶
│ 1 | 2 | | 4 | | | | | 9 | | B | C | D | E | F | │
└──────────────────────────────────────────────────────────────┘
Bottom layer is current array (TransactionId *KnownAssignedXids).
When we remove Xid we need theoretical minimum of cachelines touched. I'd say 5-7 instead of 10-15 of binsearch (in case of millions of entries in KnownAssignedXids).
Enumerating running Xids is not that difficult too: we will need to scan O(xip) memory, not whole KnownAssignedXids array.

But the overall complexity of this approach does not seem good to me.

All in all, I think using proposed "KnownAssignedXidsNext" patch solves real problem and the problem of binary searches should be addressed by compressing KnownAssignedXids more often.

Currently we do not compress array
if (nelements < 4 * PROCARRAY_MAXPROCS || // It's not that big yet OR
nelements < 2 * pArray->numKnownAssignedXids) // It's contains less than a half of a bloat
return;
From my POV arbitrary number 4 is just too high.

Summary: I think ("KnownAssignedXidsNext" patch + compressing KnownAssignedXids more often) is better than major KnownAssignedXids redesign.

Best regards, Andrey Borodin.

[0] https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dinesh Chemuduru 2021-11-07 13:23:19 Re: [PROPOSAL] new diagnostic items for the dynamic sql
Previous Message Etsuro Fujita 2021-11-07 09:26:46 Re: Commitfest 2021-11 Patch Triage - Part 1