Re: Improve eviction algorithm in ReorderBuffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Shubham Khanna <khannashubham1197(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improve eviction algorithm in ReorderBuffer
Date: 2024-02-06 06:05:22
Message-ID: CAD21AoA7L2Vwn3=3wMQxGrkisHfB7XK8wjh9x-rD7Q7Vms16LA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 2, 2024 at 5:16 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Fri, Feb 2, 2024 at 1:59 PM Shubham Khanna
> <khannashubham1197(at)gmail(dot)com> wrote:
> >
> > On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > > >
> > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > The individual transactions shouldn't cross
> > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > > > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
> > > > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > > > transactions consuming 5% >= of the memory limit? I got the impression
> > > > > > > > that you are saying to maintain them in a separate transaction list
> > > > > > > > which doesn't seems to be the case.
> > > > > > >
> > > > > > > I wanted to mean that there are three lists in total: the first one
> > > > > > > maintain the transactions consuming more than 10% of
> > > > > > > logical_decoding_work_mem,
> > > > > > >
> > > > > >
> > > > > > How can we have multiple transactions in the list consuming more than
> > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > > before any xact reaches logical_decoding_work_mem?
> > > > >
> > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > > possible that the list has three transactions each of which are
> > > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > > still 30MB (less than logical_decoding_work_mem).
> > > > >
> > > >
> > > > Thanks for the clarification. I misunderstood the list to have
> > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > > thing to note is that maintaining these lists by default can also have
> > > > some overhead unless the list of open transactions crosses a certain
> > > > threshold.
> > > >
> > >
> > > On further analysis, I realized that the approach discussed here might
> > > not be the way to go. The idea of dividing transactions into several
> > > subgroups is to divide a large number of entries into multiple
> > > sub-groups so we can reduce the complexity to search for the
> > > particular entry. Since we assume that there are no big differences in
> > > entries' sizes within a sub-group, we can pick the entry to evict in
> > > O(1). However, what we really need to avoid here is that we end up
> > > increasing the number of times to evict entries because serializing an
> > > entry to the disk is more costly than searching an entry on memory in
> > > general.
> > >
> > > I think that it's no problem in a large-entries subgroup but when it
> > > comes to the smallest-entries subgroup, like for entries consuming
> > > less than 5% of the limit, it could end up evicting many entries. For
> > > example, there would be a huge difference between serializing 1 entry
> > > consuming 5% of the memory limit and serializing 5000 entries
> > > consuming 0.001% of the memory limit. Even if we can select 5000
> > > entries quickly, I think the latter would be slower in total. The more
> > > subgroups we create, the more the algorithm gets complex and the
> > > overheads could cause. So I think we need to search for the largest
> > > entry in order to minimize the number of evictions anyway.
> > >
> > > Looking for data structures and algorithms, I think binaryheap with
> > > some improvements could be promising. I mentioned before why we cannot
> > > use the current binaryheap[1]. The missing pieces are efficient ways
> > > to remove the arbitrary entry and to update the arbitrary entry's key.
> > > The current binaryheap provides binaryheap_remove_node(), which is
> > > O(log n), but it requires the entry's position in the binaryheap. We
> > > can know the entry's position just after binaryheap_add_unordered()
> > > but it might be changed after heapify. Searching the node's position
> > > is O(n). So the improvement idea is to add a hash table to the
> > > binaryheap so that it can track the positions for each entry so that
> > > we can remove the arbitrary entry in O(log n) and also update the
> > > arbitrary entry's key in O(log n). This is known as the indexed
> > > priority queue. I've attached the patch for that (0001 and 0002).
> > >
> > > That way, in terms of reorderbuffer, we can update and remove the
> > > transaction's memory usage in O(log n) (in worst case and O(1) in
> > > average) and then pick the largest transaction in O(1). Since we might
> > > need to call ReorderBufferSerializeTXN() even in non-streaming case,
> > > we need to maintain the binaryheap anyway. I've attached the patch for
> > > that (0003).
> > >
> > > Here are test script for many sub-transactions case:
> > >
> > > create table test (c int);
> > > create or replace function testfn (cnt int) returns void as $$
> > > begin
> > > for i in 1..cnt loop
> > > begin
> > > insert into test values (i);
> > > exception when division_by_zero then
> > > raise notice 'caught error';
> > > return;
> > > end;
> > > end loop;
> > > end;
> > > $$
> > > language plpgsql;
> > > select pg_create_logical_replication_slot('s', 'test_decoding');
> > > select testfn(50000);
> > > set logical_decoding_work_mem to '4MB';
> > > select count(*) from pg_logical_slot_peek_changes('s', null, null)";
> > >
> > > and here are results:
> > >
> > > * HEAD: 16877.281 ms
> > > * HEAD w/ patches (0001 and 0002): 655.154 ms
> > >
> > > There is huge improvement in a many-subtransactions case.
> >
> > I have run the same test and found around 12.53x improvement(the
> > median of five executions):
> > HEAD | HEAD+ v2-0001+ v2-0002 + v2-0003 patch
> > 29197ms | 2329ms
> >
> > I had also run the regression test that you had shared at [1], there
> > was a very very slight dip in this case around it takes around 0.31x
> > more time:
> > HEAD | HEAD + v2-0001+ v2-0002 + v2-0003 patch
> > 4459ms | 4473ms
>
> Thank you for doing a benchmark test with the latest patches!
>
> I'm going to submit the new version patches next week.
>

I've attached the new version patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v3-0003-Improve-transaction-eviction-algorithm-in-Reorder.patch application/octet-stream 10.8 KB
v3-0002-Add-functions-for-updating-keys-and-removing-node.patch application/octet-stream 15.4 KB
v3-0001-Make-binaryheap-enlareable.patch application/octet-stream 2.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-02-06 06:11:05 Re: Make COPY format extendable: Extract COPY TO format implementations
Previous Message Bharath Rupireddy 2024-02-06 05:57:45 Re: On login trigger: take three