Re: Improve eviction algorithm in ReorderBuffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improve eviction algorithm in ReorderBuffer
Date: 2024-01-30 08:06:34
Message-ID: CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv=jzmAp9ex7eyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > >
> > > > > >
> > > > > > The individual transactions shouldn't cross
> > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
> > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > transactions consuming 5% >= of the memory limit? I got the impression
> > > > > > that you are saying to maintain them in a separate transaction list
> > > > > > which doesn't seems to be the case.
> > > > >
> > > > > I wanted to mean that there are three lists in total: the first one
> > > > > maintain the transactions consuming more than 10% of
> > > > > logical_decoding_work_mem,
> > > > >
> > > >
> > > > How can we have multiple transactions in the list consuming more than
> > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > before any xact reaches logical_decoding_work_mem?
> > >
> > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > consuming more than 6.4MB are added to the list. So for example, it's
> > > possible that the list has three transactions each of which are
> > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > still 30MB (less than logical_decoding_work_mem).
> > >
> >
> > Thanks for the clarification. I misunderstood the list to have
> > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > thing to note is that maintaining these lists by default can also have
> > some overhead unless the list of open transactions crosses a certain
> > threshold.
> >
>
> On further analysis, I realized that the approach discussed here might
> not be the way to go. The idea of dividing transactions into several
> subgroups is to divide a large number of entries into multiple
> sub-groups so we can reduce the complexity to search for the
> particular entry. Since we assume that there are no big differences in
> entries' sizes within a sub-group, we can pick the entry to evict in
> O(1). However, what we really need to avoid here is that we end up
> increasing the number of times to evict entries because serializing an
> entry to the disk is more costly than searching an entry on memory in
> general.
>
> I think that it's no problem in a large-entries subgroup but when it
> comes to the smallest-entries subgroup, like for entries consuming
> less than 5% of the limit, it could end up evicting many entries. For
> example, there would be a huge difference between serializing 1 entry
> consuming 5% of the memory limit and serializing 5000 entries
> consuming 0.001% of the memory limit. Even if we can select 5000
> entries quickly, I think the latter would be slower in total. The more
> subgroups we create, the more the algorithm gets complex and the
> overheads could cause. So I think we need to search for the largest
> entry in order to minimize the number of evictions anyway.
>
> Looking for data structures and algorithms, I think binaryheap with
> some improvements could be promising. I mentioned before why we cannot
> use the current binaryheap[1]. The missing pieces are efficient ways
> to remove the arbitrary entry and to update the arbitrary entry's key.
> The current binaryheap provides binaryheap_remove_node(), which is
> O(log n), but it requires the entry's position in the binaryheap. We
> can know the entry's position just after binaryheap_add_unordered()
> but it might be changed after heapify. Searching the node's position
> is O(n). So the improvement idea is to add a hash table to the
> binaryheap so that it can track the positions for each entry so that
> we can remove the arbitrary entry in O(log n) and also update the
> arbitrary entry's key in O(log n). This is known as the indexed
> priority queue. I've attached the patch for that (0001 and 0002).
>
> That way, in terms of reorderbuffer, we can update and remove the
> transaction's memory usage in O(log n) (in worst case and O(1) in
> average) and then pick the largest transaction in O(1). Since we might
> need to call ReorderBufferSerializeTXN() even in non-streaming case,
> we need to maintain the binaryheap anyway.

Since if the number of transactions being decoded is small, updating
max-heap for each memory counter update could lead to some
regressions, I've measured it with the case where updating memory
counter happens frequently:

setup script:
create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 8000000);

benchmark script:
set work_mem to '3GB';
set logical_decoding_work_mem to '5GB';
select count(*) from pg_logical_slot_peek_changes('s', null, null);

Here are results (the median of five executions):

* HEAD
5274.765 ms

* HEAD + 0001-0003 patch
5532.203 ms

There were approximately 5% performance regressions.

An improvement idea is that we use two strategies for updating
max-heap depending on the number of transactions. That is, if the
number of transactions being decoded is small, we add a transaction to
max-heap by binaryheap_add_unordered(), which is O(1), and heapify it
just before picking the largest transactions, which is O(n). That way,
we can minimize the overhead of updating the memory counter. Once the
number of transactions being decoded exceeds the threshold, say 1024,
we use another strategy. We call binaryheap_update_up/down() when
updating the memory counter to preserve heap property, which is O(log
n), and pick the largest transaction in O(1). This strategy minimizes
the cost of picking the largest transactions instead of paying some
costs to update the memory counters.

I've experimented with this idea and run the same tests:

* HEAD + new patches (0001 - 0003)
5277.524 ms

The number looks good. I've attached these patches. Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v2-0001-Make-binaryheap-enlareable.patch application/octet-stream 2.8 KB
v2-0003-Improve-transaction-eviction-algorithm-in-Reorder.patch application/octet-stream 10.3 KB
v2-0002-Add-functions-for-updating-keys-and-removing-node.patch application/octet-stream 15.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-01-30 08:14:07 Re: why there is not VACUUM FULL CONCURRENTLY?
Previous Message Pavel Stehule 2024-01-30 08:01:57 why there is not VACUUM FULL CONCURRENTLY?