Re: Improve eviction algorithm in ReorderBuffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improve eviction algorithm in ReorderBuffer
Date: 2024-01-26 08:36:36
Message-ID: CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > >
> > > > > The individual transactions shouldn't cross
> > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
> > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > transactions consuming 5% >= of the memory limit? I got the impression
> > > > > that you are saying to maintain them in a separate transaction list
> > > > > which doesn't seems to be the case.
> > > >
> > > > I wanted to mean that there are three lists in total: the first one
> > > > maintain the transactions consuming more than 10% of
> > > > logical_decoding_work_mem,
> > > >
> > >
> > > How can we have multiple transactions in the list consuming more than
> > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > before any xact reaches logical_decoding_work_mem?
> >
> > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > consuming more than 6.4MB are added to the list. So for example, it's
> > possible that the list has three transactions each of which are
> > consuming 10MB while the total memory usage in the reorderbuffer is
> > still 30MB (less than logical_decoding_work_mem).
> >
>
> Thanks for the clarification. I misunderstood the list to have
> transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> thing to note is that maintaining these lists by default can also have
> some overhead unless the list of open transactions crosses a certain
> threshold.
>

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway. I've attached the patch for
that (0003).

Here are test script for many sub-transactions case:

create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(50000);
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)";

and here are results:

* HEAD: 16877.281 ms
* HEAD w/ patches (0001 and 0002): 655.154 ms

There is huge improvement in a many-subtransactions case.

Finally, we need to note that memory counter updates could happen
frequently as we update it for each change. So even though we update
the binaryheap in O(log n), it could be a huge overhead if it happens
quite often. One idea is to batch the memory counter updates where
available. I've attached the patch for that (0004). I'll benchmark
overheads for normal cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v1-0004-Batch-memory-counter-updates-in-ReorderBuffer.patch application/octet-stream 11.5 KB
v1-0001-Make-binaryheap-enlareable.patch application/octet-stream 2.8 KB
v1-0003-Improve-transaction-eviction-algorithm-in-Reorder.patch application/octet-stream 5.3 KB
v1-0002-Add-functions-for-updating-keys-and-removing-node.patch application/octet-stream 9.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Junwang Zhao 2024-01-26 08:41:50 Re: Make COPY format extendable: Extract COPY TO format implementations
Previous Message Sutou Kouhei 2024-01-26 08:32:46 Re: Make COPY format extendable: Extract COPY TO format implementations