Re: Improve eviction algorithm in ReorderBuffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Shubham Khanna <khannashubham1197(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improve eviction algorithm in ReorderBuffer
Date: 2024-02-26 06:46:53
Message-ID: CAD21AoDKMMRMEWAyL_nBpzJwktKyWLSmao_2uOA+_CrrB77bnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> Hi,
>
> I did a basic review and testing of this patch today. Overall I think
> the patch is in very good shape - I agree with the tradeoffs it makes,
> and I like the approach in general. I do have a couple minor comments
> about the code, and then maybe a couple thoughts about the approach.

Thank you for the review comments and tests!

>
>
> First, some comments - I'll put them here, but I also kept them in
> "review" commits, because that makes it easier to show the exact place
> in the code the comment is about.
>
> 1) binaryheap_allocate got a new "indexed" argument, but the comment is
> not updated to document it

Fixed.

>
> 2) I think it's preferable to use descriptive argument names for
> bh_set_node. I don't think there's a good reason to keep it short.

Agreed.

>
> 3) In a couple places we have code like this:
>
> if (heap->bh_indexed)
> bh_nodeidx_delete(heap->bh_nodeidx, result);
>
> Maybe it'd be better to have the if condition in bh_nodeidx_delete, so
> that it can be called without it.

Fixed.

>
> 4) Could we check the "found" flag in bh_set_node, somehow? I mean, we
> either expect to find the node (update of already tracked transaction)
> or not (when inserting it). The life cycle may be non-trivial (node
> added, updated and removed, ...), would be useful assert I think.

Agreed.

>
> 5) Do we actually need the various mem_freed local variables in a couple
> places, when we expect the value to be equal to txn->size (there's even
> assert enforcing that)?

You're right.

>
> 6) ReorderBufferCleanupTXN has a comment about maybe not using the same
> threshold both to enable & disable usage of the binaryheap. I agree with
> that, otherwise we could easily end up "trashing" if we add/remove
> transactions right around the threshold. I think 90-95% for disabling
> the heap would work fine.

Agreeehd.

>
> 7) The code disabling binaryheap (based on the threshold) is copied in a
> couple places, perhaps it should be a separate function called from
> those places.

Fixed.

>
> 8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the
> memory size check internally, to make the calls simpler.

Agreed.

>
> 9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate
> split maybe not very clear. It's not clear to me why it's divided like
> this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly.

I think that now we have two use cases: updating the memory counter
after freeing individual change, and updating the memory counter after
freeing all changes of the transaction (i.e., making the counter to
0). In the former case, we need to check if the change is
REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID but we don't need to pass the
transaction as the change has its transaction. On the other hand, in
the latter case, we don't need the change but need to pass the
transaction. If we do both things in one function, the function would
have two arguments: change and txn, and the callers set either one
that they know. I've updated the patch accordingly.

BTW it might be worth considering to create a separate patch for the
updates around ReorderBufferChangeMemoryUpdate() that batches the
memory counter updates as it seems an independent change from the
max-heap stuff.

>
> performance
> -----------
>
> I did some benchmarks, to see the behavior in simple good/bad cases (see
> the attached scripts.tgz). "large" is one large transaction inserting 1M
> rows, small is 64k single-row inserts, and subxacts is the original case
> with ~100k subxacts. Finally, subxacts-small is many transactions with
> 128 subxacts each (the main transactions are concurrent).
>
> The results are pretty good, I think:
>
> test master patched
> -----------------------------------------------------
> large 2587 2459 95%
> small 956 856 89%
> subxacts 138915 2911 2%
> subxacts-small 13632 13187 97%

Thank you for doing the performance test. I ran the same script you
shared on my machine just in case and got similar results:

master patched
large: 2831 2827
small: 1226 1222
subxacts: 134076 2744
subxacts-small: 23384 23127

In my case, the differences seem to be within a noise range.

>
> This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB,
> where the subxact timing goes way down, but the overall conclusions do
> not change.
>
> I was a bit surprised I haven't seen any clear regression, but in the
> end that's a good thing, right? There's a couple results in this thread
> showing ~10% regression, but I've been unable to reproduce those.
> Perhaps the newer patch versions fix that, I guess.

Yes, the 10% regression is fixed in the v4 patch. We don't update the
max-heap at all until the number of transactions reaches the threshold
so I think there is mostly 0 overhead in normal cases.

>
> Anyway, I think that at some point we'd have to accept that some cases
> may have slight regression. I think that's inherent for almost any
> heuristics - there's always going to be some rare case that defeats it.
> What's important is that the case needs to be rare and/or the impact
> very limited. And I think that's true here.

Agreed.

>
>
> overall design
> --------------
>
> As for the design, I agree with the approach of using a binaryheap to
> track transactions by size. When going over the thread history,
> describing the initial approach with only keeping "large" transactions
> above some threshold (e.g. 10%), I was really concerned that'll either
> lead to abrupt changes in behavior (when transactions move just around
> the 10%), or won't help with many common cases (with most transactions
> being below the limit).
>
> I was going to suggest some sort of "binning" - keeping lists for
> transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
> evicting transactions from a list, i.e. based on approximate size. But
> if the indexed binary heap seems to be cheap enough, I think it's a
> better solution.

I've also considered the binning idea. But it was not clear to me how
it works well in a case where all transactions belong to the
particular class. For example, if we need to free up 1MB memory, we
could end up evicting 2000 transactions consuming 50 bytes instead of
100 transactions consuming 1000 bytes, resulting in that we end up
serializing more transactions. Also, I'm concerned about the cost of
maintaining the binning lists.

>
> The one thing I'm a bit concerned about is the threshold used to start
> using binary heap - these thresholds with binary decisions may easily
> lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
> with significant runtime change (e.g. you add/remove one transaction and
> the code takes a much more expensive path). The value (1024) seems
> rather arbitrary, I wonder if there's something to justify that choice.

True. 1024 seems small to me. In my environment, I started to see a
big difference from around 40000 transactions. But it varies depending
on environments and workloads.

I think that this performance problem we're addressing doesn't
normally happen as long as all transactions being decoded are
top-level transactions. Otherwise, we also need to improve
ReorderBufferLargestStreamableTopTXN(). Given this fact, I think
max_connections = 1024 is a possible value in some systems, and I've
observed such systems sometimes. On the other hand, I've observed >
5000 in just a few cases, and having more than 5000 transactions in
ReorderBuffer seems unlikely to happen without subtransactions. I
think we can say it's an extreme case, the number is still an
arbitrary number though.

Or probably we can compute the threshold based on max_connections,
e.g., max_connections * 10. That way, we can ensure that users won't
incur the max-heap maintenance costs as long as they don't use
subtransactions.

>
> In any case, I agree it'd be good to have some dampening factor, to
> reduce the risk of trashing because of adding/removing a single
> transaction to the decoding.
>
>
> related stuff / GenerationContext
> ---------------------------------
>
> It's not the fault of this patch, but this reminds me I have some doubts
> about how the eviction interferes with using the GenerationContext for
> some of the data. I suspect we can easily get into a situation where we
> evict the largest transaction, but that doesn't actually reduce the
> memory usage at all, because the memory context blocks are shared with
> some other transactions and don't get 100% empty (so we can't release
> them). But it's actually worse, because GenerationContext does not even
> reuse this memory. So do we even gain anything by the eviction?
>
> When the earlier patch versions also considered age of the transaction,
> to try evicting the older ones first, I think that was interesting. I
> think we may want to do something like this even with the binary heap.

Thank you for raising this issue. This is one of the highest priority
items in my backlog. We've seen cases where the logical decoding uses
much more memory than logical_decoding_work_mem value[1][2] (e.g. it
used 4GB memory even though the logical_decoding_work_mem was 256kB).
I think that the problem would still happen even with this improvement
on the eviction.

I believe these are separate problems we can address, and evicting
large transactions first would still be the right strategy. We might
want to improve how we store changes in memory contexts. For example,
it might be worth having per-transaction memory context so that we can
actually free memory blocks by the eviction. We can discuss it in a
separate thread.

>
>
> related stuff / increase of logical_decoding_work_mem
> -----------------------------------------------------
>
> In the thread, one of the "alternatives to spilling" suggested in the
> thread was to enable streaming, but I think there's often a much more
> efficient alternative - increase the amount of memory, so that we don't
> actually need to spill.

Agreed.

>
> For example, a system may be doing a lot of eviction / spilling with
> logical_decoding_work_mem=64MB, but setting 128MB may completely
> eliminate that. Of course, if there are large transactions, this may not
> be possible (the GUC would have to exceed RAM). But I don't think that's
> very common, the incidents that I've observed were often resolved by
> bumping the logical_decoding_work_mem by a little bit.
>
> I wonder if there's something we might do to help users to tune this. We
> should be able to measure the "peak" memory usage (how much memory we'd
> need to not spill), so maybe we could log that as a WARNING, similarly
> to checkpoints - there we only log "checkpoints too frequent, tune WAL
> limits", but perhaps we might do more here? Or maybe we could add the
> watermark to the system catalog?

Interesting ideas.

The statistics such as spill_count shown in pg_stat_replication_slots
view could already give hints to users to increase the
logical_decoding_work_mem. In addition to that, it's an interesting
idea to have the high water mark in the view.

I've attached updated patches.

Regards,

[1] https://www.postgresql.org/message-id/CAMnUB3oYugXCBLSkih%2BqNsWQPciEwos6g_AMbnz_peNoxfHwyw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/17974-f8c9d353a62f414d%40postgresql.org

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v6-0001-Make-binaryheap-enlargeable.patch application/octet-stream 3.1 KB
v6-0003-Use-max-heap-to-efficiently-select-largest-transa.patch application/octet-stream 15.7 KB
v6-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch application/octet-stream 15.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2024-02-26 06:49:42 Re: RFC: Logging plan of the running query
Previous Message Ashutosh Bapat 2024-02-26 06:16:11 Re: SQL Property Graph Queries (SQL/PGQ)