Re: Improve eviction algorithm in ReorderBuffer

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improve eviction algorithm in ReorderBuffer
Date: 2024-01-31 08:31:39
Message-ID: CALDaNm107ed2XWz+2uAKd8prExKZEn1-hWa3n_P4ZCZvC7zs+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > The individual transactions shouldn't cross
> > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
> > > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > > transactions consuming 5% >= of the memory limit? I got the impression
> > > > > > > that you are saying to maintain them in a separate transaction list
> > > > > > > which doesn't seems to be the case.
> > > > > >
> > > > > > I wanted to mean that there are three lists in total: the first one
> > > > > > maintain the transactions consuming more than 10% of
> > > > > > logical_decoding_work_mem,
> > > > > >
> > > > >
> > > > > How can we have multiple transactions in the list consuming more than
> > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > before any xact reaches logical_decoding_work_mem?
> > > >
> > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > possible that the list has three transactions each of which are
> > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > still 30MB (less than logical_decoding_work_mem).
> > > >
> > >
> > > Thanks for the clarification. I misunderstood the list to have
> > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > thing to note is that maintaining these lists by default can also have
> > > some overhead unless the list of open transactions crosses a certain
> > > threshold.
> > >
> >
> > On further analysis, I realized that the approach discussed here might
> > not be the way to go. The idea of dividing transactions into several
> > subgroups is to divide a large number of entries into multiple
> > sub-groups so we can reduce the complexity to search for the
> > particular entry. Since we assume that there are no big differences in
> > entries' sizes within a sub-group, we can pick the entry to evict in
> > O(1). However, what we really need to avoid here is that we end up
> > increasing the number of times to evict entries because serializing an
> > entry to the disk is more costly than searching an entry on memory in
> > general.
> >
> > I think that it's no problem in a large-entries subgroup but when it
> > comes to the smallest-entries subgroup, like for entries consuming
> > less than 5% of the limit, it could end up evicting many entries. For
> > example, there would be a huge difference between serializing 1 entry
> > consuming 5% of the memory limit and serializing 5000 entries
> > consuming 0.001% of the memory limit. Even if we can select 5000
> > entries quickly, I think the latter would be slower in total. The more
> > subgroups we create, the more the algorithm gets complex and the
> > overheads could cause. So I think we need to search for the largest
> > entry in order to minimize the number of evictions anyway.
> >
> > Looking for data structures and algorithms, I think binaryheap with
> > some improvements could be promising. I mentioned before why we cannot
> > use the current binaryheap[1]. The missing pieces are efficient ways
> > to remove the arbitrary entry and to update the arbitrary entry's key.
> > The current binaryheap provides binaryheap_remove_node(), which is
> > O(log n), but it requires the entry's position in the binaryheap. We
> > can know the entry's position just after binaryheap_add_unordered()
> > but it might be changed after heapify. Searching the node's position
> > is O(n). So the improvement idea is to add a hash table to the
> > binaryheap so that it can track the positions for each entry so that
> > we can remove the arbitrary entry in O(log n) and also update the
> > arbitrary entry's key in O(log n). This is known as the indexed
> > priority queue. I've attached the patch for that (0001 and 0002).
> >
> > That way, in terms of reorderbuffer, we can update and remove the
> > transaction's memory usage in O(log n) (in worst case and O(1) in
> > average) and then pick the largest transaction in O(1). Since we might
> > need to call ReorderBufferSerializeTXN() even in non-streaming case,
> > we need to maintain the binaryheap anyway.
>
> Since if the number of transactions being decoded is small, updating
> max-heap for each memory counter update could lead to some
> regressions, I've measured it with the case where updating memory
> counter happens frequently:
>
> setup script:
> create table test (c int);
> select pg_create_logical_replication_slot('s', 'test_decoding');
> insert into test select generate_series(1, 8000000);
>
> benchmark script:
> set work_mem to '3GB';
> set logical_decoding_work_mem to '5GB';
> select count(*) from pg_logical_slot_peek_changes('s', null, null);
>
> Here are results (the median of five executions):
>
> * HEAD
> 5274.765 ms
>
> * HEAD + 0001-0003 patch
> 5532.203 ms
>
> There were approximately 5% performance regressions.
>
> An improvement idea is that we use two strategies for updating
> max-heap depending on the number of transactions. That is, if the
> number of transactions being decoded is small, we add a transaction to
> max-heap by binaryheap_add_unordered(), which is O(1), and heapify it
> just before picking the largest transactions, which is O(n). That way,
> we can minimize the overhead of updating the memory counter. Once the
> number of transactions being decoded exceeds the threshold, say 1024,
> we use another strategy. We call binaryheap_update_up/down() when
> updating the memory counter to preserve heap property, which is O(log
> n), and pick the largest transaction in O(1). This strategy minimizes
> the cost of picking the largest transactions instead of paying some
> costs to update the memory counters.
>
> I've experimented with this idea and run the same tests:
>
> * HEAD + new patches (0001 - 0003)
> 5277.524 ms
>
> The number looks good. I've attached these patches. Feedback is very welcome.

Few comments:
1) Here we are changing memtrack_state to
REORDER_BUFFER_MEM_TRACK_NORMAL immediately once the size is less than
REORDE_BUFFER_MEM_TRACK_THRESHOLD. In this scenario we will be
building the heap many times if there are transactions getting added
and removed. How about we wait for txn_heap to become less than 95% of
REORDE_BUFFER_MEM_TRACK_THRESHOLD to avoid building the heap many
times in this scenario.
+ {
+ Assert(rb->memtrack_state ==
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP);
+
+ /*
+ * If the number of transactions gets lowered than the
threshold, switch
+ * to the state where we heapify the max-heap right
before picking the
+ * largest transaction while doing nothing for memory
counter update.
+ */
+ if (binaryheap_size(rb->txn_heap) <
REORDE_BUFFER_MEM_TRACK_THRESHOLD)
+ rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NORMAL;
}

2) I felt init variable is not needed, we can directly check txn->size
instead like it is done in the else case:
+ bool init = (txn->size == 0);
+
txn->size += sz;
rb->size += sz;

/* Update the total size in the top transaction. */
toptxn->total_size += sz;
+
+ /* Update the transaction in the max-heap */
+ if (init)
+ {
+ /* Add the transaction to the max-heap */
+ if (rb->memtrack_state ==
REORDER_BUFFER_MEM_TRACK_NORMAL)
+ binaryheap_add_unordered(rb->txn_heap,
PointerGetDatum(txn));
+ else
+ binaryheap_add(rb->txn_heap,
PointerGetDatum(txn));
+ }
+ else if (rb->memtrack_state ==
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+ {
+ /*
+ * If we're maintaining max-heap even while
updating the memory counter,
+ * we reflect the updates to the max-heap.
+ */
+ binaryheap_update_up(rb->txn_heap,
PointerGetDatum(txn));
+ }

3) we can add some comments for this:
+typedef enum ReorderBufferMemTrackState
+{
+ REORDER_BUFFER_MEM_TRACK_NORMAL,
+ REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+

4) This should be added to typedefs.list:
+typedef enum ReorderBufferMemTrackState
+{
+ REORDER_BUFFER_MEM_TRACK_NORMAL,
+ REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+

5)Few typos:
5.a) enlareable should be enlargeable
[PATCH v2 1/4] Make binaryheap enlareable.

5.b) subtranasctions should be subtransactions:
On the other hand, when the number of transactions being decoded is
fairly large, such as when a transaction has many subtranasctions,

5.c) evaludate should be evaluate:
XXX: updating the transaction's memory counter and the max-heap is now
O(log n), so we need to evaludate it. If there are some regression, we

5.d) pudate should be update:
It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it

5.e) frquently should be frequently:
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't

6) This should be added to typedefs.list:
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+ bh_node_type key;
+ char status;
+ int idx;
+} bh_nodeidx_entry;

Regards,
Vignesh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-01-31 08:32:00 Re: Synchronizing slots from primary to standby
Previous Message Peter Eisentraut 2024-01-31 08:16:27 Re: Reducing output size of nodeToString