Re: Improve eviction algorithm in ReorderBuffer

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Shubham Khanna <khannashubham1197(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improve eviction algorithm in ReorderBuffer
Date: 2024-03-11 06:04:13
Message-ID: CAHut+PvZ12UDr1v7PX2CAEq1jo2rqWgB3gT9PRpQdboydAGTNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are some review comments for v8-0003

======
0. GENERAL -- why the state enum?

This patch introduced a new ReorderBufferMemTrackState with 2 states
(REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)

It's the same as having a boolean flag OFF/ON, so I didn't see any
benefit of the enum instead of a simple boolean flag like
'track_txn_sizes'.

NOTE: Below in this post (see #11) I would like to propose another
idea, which can simplify much further, eliminating the need for the
state boolean. If adopted that will impact lots of these other review
comments.

======
Commit Message

1.
Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. Which could lead to a significant replication lag
especially in case where there are many subtransactions.

~

/Which could/This could/

/in case/in the case/

======
.../replication/logical/reorderbuffer.c

2.
* We use a max-heap with transaction size as the key to efficiently find
* the largest transaction. The max-heap state is managed in two states:
* REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.

/The max-heap state is managed in two states:/The max-heap is managed
in two states:/

~~~

3.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While using max-heap to select
+ * the largest transaction is effective when there are many transactions being
+ * decoded, in many systems there is generally no need to use it as long as all
+ * transactions being decoded are top-level transactions. Therefore, we use
+ * MaxConnections as the threshold* so we can prevent switch to the
state unless
+ * we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

3a.
/memory track state./memory tracking state./

/While using max-heap/Although using max-heap/

"in many systems" (are these words adding anything?)

/threshold*/threshold/

/so we can prevent switch/so we can prevent switching/

~

3b.
There's nothing really in this name to indicate the units of the
threshold. Consider if there is some more informative name for this
macro: e.g.
MAXHEAP_TX_COUNT_THRESHOLD (?)

~~~

4.
+ /*
+ * Don't start with a lower number than
+ * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least
+ * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+ */
+ buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+ buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+ ReorderBufferTXNSizeCompare,
+ true, NULL);
+

IIUC the comment intends to say:

Allocate the initial heap size greater than THRESHOLD because the
txn_heap will not be used until the threshold is exceeded.

Also, maybe the comment should make a point of saying "Note: the
binary heap is INDEXED for faster manipulations". or something
similar.

~~~

5.
static void
ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
ReorderBufferChange *change,
+ ReorderBufferTXN *txn,
bool addition, Size sz)
{
- ReorderBufferTXN *txn;
ReorderBufferTXN *toptxn;

- Assert(change->txn);
-

There seems some trick now where the passed 'change' could be NULL,
which was not possible before. e.g., when change is NULL then 'txn' is
not NULL, and vice versa. Some explanation about this logic and the
meaning of these parameters should be written in this function
comment.

~

6.
+ txn = txn != NULL ? txn : change->txn;

IMO it's more natural to code the ternary using the same order as the
parameters:

e.g. txn = change ? change->txn : txn;

~~~

7.
/*
* Build the max-heap and switch the state. We will run a heap assembly step
* at the end, which is more efficient.
*/
static void
ReorderBufferBuildMaxHeap(ReorderBuffer *rb)

/We will run a heap assembly step at the end, which is more
efficient./The heap assembly step is deferred until the end, for
efficiency./

~~~

8. ReorderBufferLargestTXN

+ if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
+ {
+ HASH_SEQ_STATUS hash_seq;
+ ReorderBufferTXNByIdEnt *ent;
+
+ hash_seq_init(&hash_seq, rb->by_txn);
+ while ((ent = hash_seq_search(&hash_seq)) != NULL)
+ {
+ ReorderBufferTXN *txn = ent->txn;
+
+ /* if the current transaction is larger, remember it */
+ if ((!largest) || (txn->size > largest->size))
+ largest = txn;
+ }
+
+ Assert(largest);
+ }

That Assert(largest) seems redundant because there is anyway another
Assert(largest) immediately after this code.

~~~

9.
+ /* Get the largest transaction from the max-heap */
+ if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+ {
+ Assert(binaryheap_size(rb->txn_heap) > 0);
+ largest = (ReorderBufferTXN *)
+ DatumGetPointer(binaryheap_first(rb->txn_heap));
}
Assert(binaryheap_size(rb->txn_heap) > 0); seemed like slightly less
readable way of saying:

Assert(!binaryheap_empty(rb->txn_heap));

~~~

10.
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)

10a.
/Compare between sizes of two transactions./Compare two transactions by size./

~~~

10b.
IMO this comparator function belongs just before the
ReorderBufferAllocate() function since that is the only place where it
is used.

======
src/include/replication/reorderbuffer.h

11.
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+ /*
+ * We don't update max-heap while updating the memory counter. The
+ * max-heap is built before use.
+ */
+ REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+ /*
+ * We also update the max-heap when updating the memory counter so the
+ * heap property is always preserved.
+ */
+ REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+

In my GENERAL review comment #0, I suggested the removal of this
entire enum. e.g. It could be replaced with a boolean field
'track_txn_sizes'

TBH, I think there is a better way to handle this "state". IIUC
- the txn_heap is always allocated up-front.
- you only "build" it when > threshold and
- when it drops < 0.9 x threshold you reset it.

Therefore, AFAICT you do not need to maintain any “switch states” at
all; you simply need to check binaryheap_empty(txn_heap), right?
* If the heap is empty…. It means you are NOT tracking, so don’t use it
* If the heap is NOT empty …. It means you ARE tracking, so use it.

~

Using my idea to remove the state flag will have the side effect of
simplifying many other parts of this patch. For example

BEFORE
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+ if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
+ return;
+
...
+ if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+ {
+ rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+ binaryheap_reset(rb->txn_heap);
+ }
+}

AFTER
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+ if (binaryheap_empty(rb->txn_heap))
+ return;
+
...
+ if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+ binaryheap_reset(rb->txn_heap);
+}

~~~

12. struct ReorderBuffer

+ /* Max-heap for sizes of all top-level and sub transactions */
+ ReorderBufferMemTrackState memtrack_state;
+ binaryheap *txn_heap;
+

12a.
Why is this being referred to in the commit message and code comments
as "max-heap" when the field is not called by that same name? Won't it
be better to give the field a better name -- e.g. "txn_maxheap" or
similar?

~

12b.
This comment should also say that the heap is ordered by tx size --
(e.g. the comparator is ReorderBufferTXNSizeCompare)

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-03-11 06:11:33 Re: Amcheck verification of GiST and GIN
Previous Message Ashutosh Bapat 2024-03-11 06:03:24 Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY