Re: Improve eviction algorithm in ReorderBuffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improve eviction algorithm in ReorderBuffer
Date: 2024-02-02 07:53:37
Message-ID: CAD21AoCPkT4UL7EcsCHwutQu-2E8KBx6H6fP_2cOqpTb=RHWHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Jan 31, 2024 at 5:32 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > > >
> > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > The individual transactions shouldn't cross
> > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > > > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
> > > > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > > > transactions consuming 5% >= of the memory limit? I got the impression
> > > > > > > > that you are saying to maintain them in a separate transaction list
> > > > > > > > which doesn't seems to be the case.
> > > > > > >
> > > > > > > I wanted to mean that there are three lists in total: the first one
> > > > > > > maintain the transactions consuming more than 10% of
> > > > > > > logical_decoding_work_mem,
> > > > > > >
> > > > > >
> > > > > > How can we have multiple transactions in the list consuming more than
> > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > > before any xact reaches logical_decoding_work_mem?
> > > > >
> > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > > possible that the list has three transactions each of which are
> > > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > > still 30MB (less than logical_decoding_work_mem).
> > > > >
> > > >
> > > > Thanks for the clarification. I misunderstood the list to have
> > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > > thing to note is that maintaining these lists by default can also have
> > > > some overhead unless the list of open transactions crosses a certain
> > > > threshold.
> > > >
> > >
> > > On further analysis, I realized that the approach discussed here might
> > > not be the way to go. The idea of dividing transactions into several
> > > subgroups is to divide a large number of entries into multiple
> > > sub-groups so we can reduce the complexity to search for the
> > > particular entry. Since we assume that there are no big differences in
> > > entries' sizes within a sub-group, we can pick the entry to evict in
> > > O(1). However, what we really need to avoid here is that we end up
> > > increasing the number of times to evict entries because serializing an
> > > entry to the disk is more costly than searching an entry on memory in
> > > general.
> > >
> > > I think that it's no problem in a large-entries subgroup but when it
> > > comes to the smallest-entries subgroup, like for entries consuming
> > > less than 5% of the limit, it could end up evicting many entries. For
> > > example, there would be a huge difference between serializing 1 entry
> > > consuming 5% of the memory limit and serializing 5000 entries
> > > consuming 0.001% of the memory limit. Even if we can select 5000
> > > entries quickly, I think the latter would be slower in total. The more
> > > subgroups we create, the more the algorithm gets complex and the
> > > overheads could cause. So I think we need to search for the largest
> > > entry in order to minimize the number of evictions anyway.
> > >
> > > Looking for data structures and algorithms, I think binaryheap with
> > > some improvements could be promising. I mentioned before why we cannot
> > > use the current binaryheap[1]. The missing pieces are efficient ways
> > > to remove the arbitrary entry and to update the arbitrary entry's key.
> > > The current binaryheap provides binaryheap_remove_node(), which is
> > > O(log n), but it requires the entry's position in the binaryheap. We
> > > can know the entry's position just after binaryheap_add_unordered()
> > > but it might be changed after heapify. Searching the node's position
> > > is O(n). So the improvement idea is to add a hash table to the
> > > binaryheap so that it can track the positions for each entry so that
> > > we can remove the arbitrary entry in O(log n) and also update the
> > > arbitrary entry's key in O(log n). This is known as the indexed
> > > priority queue. I've attached the patch for that (0001 and 0002).
> > >
> > > That way, in terms of reorderbuffer, we can update and remove the
> > > transaction's memory usage in O(log n) (in worst case and O(1) in
> > > average) and then pick the largest transaction in O(1). Since we might
> > > need to call ReorderBufferSerializeTXN() even in non-streaming case,
> > > we need to maintain the binaryheap anyway.
> >
> > Since if the number of transactions being decoded is small, updating
> > max-heap for each memory counter update could lead to some
> > regressions, I've measured it with the case where updating memory
> > counter happens frequently:
> >
> > setup script:
> > create table test (c int);
> > select pg_create_logical_replication_slot('s', 'test_decoding');
> > insert into test select generate_series(1, 8000000);
> >
> > benchmark script:
> > set work_mem to '3GB';
> > set logical_decoding_work_mem to '5GB';
> > select count(*) from pg_logical_slot_peek_changes('s', null, null);
> >
> > Here are results (the median of five executions):
> >
> > * HEAD
> > 5274.765 ms
> >
> > * HEAD + 0001-0003 patch
> > 5532.203 ms
> >
> > There were approximately 5% performance regressions.
> >
> > An improvement idea is that we use two strategies for updating
> > max-heap depending on the number of transactions. That is, if the
> > number of transactions being decoded is small, we add a transaction to
> > max-heap by binaryheap_add_unordered(), which is O(1), and heapify it
> > just before picking the largest transactions, which is O(n). That way,
> > we can minimize the overhead of updating the memory counter. Once the
> > number of transactions being decoded exceeds the threshold, say 1024,
> > we use another strategy. We call binaryheap_update_up/down() when
> > updating the memory counter to preserve heap property, which is O(log
> > n), and pick the largest transaction in O(1). This strategy minimizes
> > the cost of picking the largest transactions instead of paying some
> > costs to update the memory counters.
> >
> > I've experimented with this idea and run the same tests:
> >
> > * HEAD + new patches (0001 - 0003)
> > 5277.524 ms
> >
> > The number looks good. I've attached these patches. Feedback is very welcome.
>
> Few comments:

Thank you for the review comments!

> 1) Here we are changing memtrack_state to
> REORDER_BUFFER_MEM_TRACK_NORMAL immediately once the size is less than
> REORDE_BUFFER_MEM_TRACK_THRESHOLD. In this scenario we will be
> building the heap many times if there are transactions getting added
> and removed. How about we wait for txn_heap to become less than 95% of
> REORDE_BUFFER_MEM_TRACK_THRESHOLD to avoid building the heap many
> times in this scenario.

But until we call ReorderBufferLargestTXN() next time, we will have
decoded some changes, added or removed transactions, which modifies
the transaction size. Is it okay to do the memory counter updates in
O(log n) during that? I guess your idea works well in the case where
until we call ReorderBufferLargestTXN() next time, only a few
transactions' memory counters have been updated.

I realized that the state could never switch to NORMAL in cases where
the number of transactions got lower than the threshold but the total
memory usage doesn't exceed the limit. I'll fix it.

>
> 2) I felt init variable is not needed, we can directly check txn->size
> instead like it is done in the else case:
> + bool init = (txn->size == 0);
> +
> txn->size += sz;
> rb->size += sz;
>
> /* Update the total size in the top transaction. */
> toptxn->total_size += sz;
> +
> + /* Update the transaction in the max-heap */
> + if (init)
> + {
> + /* Add the transaction to the max-heap */
> + if (rb->memtrack_state ==
> REORDER_BUFFER_MEM_TRACK_NORMAL)
> + binaryheap_add_unordered(rb->txn_heap,
> PointerGetDatum(txn));
> + else
> + binaryheap_add(rb->txn_heap,
> PointerGetDatum(txn));
> + }
> + else if (rb->memtrack_state ==
> REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
> + {
> + /*
> + * If we're maintaining max-heap even while
> updating the memory counter,
> + * we reflect the updates to the max-heap.
> + */
> + binaryheap_update_up(rb->txn_heap,
> PointerGetDatum(txn));
> + }

Okay, we can replace it with "(txn->size - sz) == 0".

>
> 3) we can add some comments for this:
> +typedef enum ReorderBufferMemTrackState
> +{
> + REORDER_BUFFER_MEM_TRACK_NORMAL,
> + REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
> +} ReorderBufferMemTrackState;
> +
>
> 4) This should be added to typedefs.list:
> +typedef enum ReorderBufferMemTrackState
> +{
> + REORDER_BUFFER_MEM_TRACK_NORMAL,
> + REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
> +} ReorderBufferMemTrackState;
> +

Will add them.

>
> 5)Few typos:
> 5.a) enlareable should be enlargeable
> [PATCH v2 1/4] Make binaryheap enlareable.
>
> 5.b) subtranasctions should be subtransactions:
> On the other hand, when the number of transactions being decoded is
> fairly large, such as when a transaction has many subtranasctions,
>
> 5.c) evaludate should be evaluate:
> XXX: updating the transaction's memory counter and the max-heap is now
> O(log n), so we need to evaludate it. If there are some regression, we
>
> 5.d) pudate should be update:
> It could be too expensive to pudate max-heap while preserving the heap
> property each time the transaction's memory counter is updated, as it
>
> 5.e) frquently should be frequently:
> could happen very frquently. So when the number of transactions being
> decoded is small, we add the transactions to max-heap but don't
>

Thanks, will fix them.

> 6) This should be added to typedefs.list:
> +/*
> + * Struct for A hash table element to store the node's index in the bh_nodes
> + * array.
> + */
> +typedef struct bh_nodeidx_entry
> +{
> + bh_node_type key;
> + char status;
> + int idx;
> +} bh_nodeidx_entry;

Right, will add it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-02-02 08:04:28 Re: Make COPY format extendable: Extract COPY TO format implementations
Previous Message Richard Guo 2024-02-02 07:48:28 Re: Commitfest 2024-01 is now closed