Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: tender wang <tndrwang(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
Subject: Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Date: 2023-12-13 04:56:23
Message-ID: CAFiTN-tw2x_2SijBnzMekTXEF8-Pf15SafU48YO-eRh3L_s13w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 12, 2023 at 6:58 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> [Added Andrey again in CC, because as I understand they are using this
> code or something like it in production. Please don't randomly remove
> people from CC lists.]

Oh, glad to know that. Yeah, I generally do not remove but I have
noticed that in the mail chain, some of the reviewers just replied to
me and the hackers' list, and from that point onwards I lost track of
the CC list.

> I've been looking at this some more, and I'm not confident in that the
> group clog update stuff is correct. I think the injection points test
> case was good enough to discover a problem, but it's hard to get peace
> of mind that there aren't other, more subtle problems.

Yeah, I agree.

> The problem I see is that the group update mechanism is designed around
> contention of the global xact-SLRU control lock; it uses atomics to
> coordinate a single queue when the lock is contended. So if we split up
> the global SLRU control lock using banks, then multiple processes using
> different bank locks might not contend. OK, this is fine, but what
> happens if two separate groups of processes encounter contention on two
> different bank locks? I think they will both try to update the same
> queue, and coordinate access to that *using different bank locks*. I
> don't see how can this work correctly.

Let's back up a bit and start from the current design with the
centralized lock. With that, if one process is holding the lock the
other processes will try to perform the group update, and if there is
already a group that still hasn't got the lock but trying to update
the different CLOG page then what this process wants to update then it
will not add itself for the group update instead it will fallback to
the normal lock wait. Now, in another situation, it may so happen
that the group leader of the other group already got the control lock
and in such case, it would have cleared the
'procglobal->clogGroupFirst' that means now we will start forming a
different group. So logically if we talk only about the optimization
part then the thing is that it is assumed that at a time when we are
trying to commit a log of concurrent xid then those xids are mostly of
the same range and will fall in the same SLRU page and group update
will help them. But if we are getting some out-of-range xid of some
long-running transaction they might not even go for group update as
the page number will be different. Although the situation might be
better here with a bank-wise lock because there if those xids are
falling in altogether a different bank it might not even contend.

Now, let's talk about the correctness, I think even though we are
getting processes that might be contending on different bank-lock,
still we are ensuring that in a group all the processes are trying to
update the same SLRU page (i.e. same bank also, we will talk about the
exception later[1]). One of the processes is becoming a leader and as
soon as the leader gets the lock it detaches the queue from the
'procglobal->clogGroupFirst' by setting it as INVALID_PGPROCNO so that
other group update requesters now can form another parallel group.
But here I do not see a problem with correctness.

I agree someone might say that since now there is a possibility that
different groups might get formed for different bank locks we do not
get other groups to get formed until we get the lock for our bank as
we do not clear 'procglobal->clogGroupFirst' before we get the lock.
Other requesters might want to update the page in different banks so
why block them? But the thing is the group update design is optimized
for the cases where all requesters are trying to update the status of
xids generated near the same range.

> I suspect the first part of that algorithm, where atomics are used to
> create the list without a lock, might work fine. But will each "leader"
> process, each of which is potentially using a different bank lock,
> coordinate correctly? Maybe this works correctly because only one
> process will find the queue head not empty? If this is what happens,
> then there needs to be comments about it.

Yes, you got it right, I will try to comment on it better.

Without any explanation,
> this seems broken and potentially dangerous, as some transaction commit
> bits might become lost given high enough concurrency and bad luck.
>
> Maybe this can be made to work by having one more lwlock that we use
> solely to coordinate this task.

Do you mean to say a different lock for adding/removing in the list
instead of atomic operation? I think then we will lose the benefit we
got in the group update by having contention on another lock.

[1] I think we already know about the exception case and I have
explained in the comments as well that in some cases we might add
different clog page update requests in the same group, and for
handling that exceptional case we are checking the respective bank
lock for each page and if that exception occurred we will release the
old bank lock and acquire a new lock. This case might not be
performant because now it is possible that after getting the lock
leader might need to wait again on another bank lock, but this is an
extremely exceptional case so should not be worried about performance
and I do not see any correctness issue here as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-12-13 05:10:24 Re: Synchronizing slots from primary to standby
Previous Message Tom Lane 2023-12-13 04:16:27 Re: Improve upcasting for INT range and multi range types