Re: Making the initial and maximum DSA segment sizes configurable

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Subject: Re: Making the initial and maximum DSA segment sizes configurable
Date: 2023-12-19 06:03:26
Message-ID: CAD21AoCVMw6DSmgZY9h+xfzKtzJeqWiwxaUD2T-FztVcV-XibQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Mar 22, 2023 at 12:15 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Hi all,
>
> I started this new thread from another thread[1] where we're
> discussing a new storage for TIDs, TidStore, since we found a
> difficulty about the memory usage limit for TidStores on DSA.
>
> TidStore is a new data structure to efficiently store TIDs, backed by
> a radix tree. In the patch series proposed on that thread, in addition
> to radix tree and TidStore, there is another patch for lazy (parallel)
> vacuum to replace the array of dead tuple TIDs with a TidStore. To
> support parallel vacuum, radix tree (and TidStore) can be created on a
> local memory as well as on DSA. Also, it has memory usage limit
> functionality; we can specify the memory limit (e.g.,
> maintenance_work_mem) to TidStoreCreate() function. Once the total DSA
> segment size (area->control->total_segment_size) exceeds the limit,
> TidStoreIsFull() returns true. The lazy vacuum can continue scanning
> heap blocks to collect dead tuple TIDs until TidStoreIsFull() returns
> true. Currently lazy vacuum is the sole user of TidStore but maybe it
> can be used by other codes such as tidbitmap.c where will be limited
> by work_mem.
>
> During the development, we found out that DSA memory growth is
> unpredictable, leading to inefficient memory limitation.
>
> DSA is built on top of DSM segments and it manages a set of DSM
> segments, adding new segments as required and detaching them when they
> are no longer needed. The DSA segment starts with 1MB in size and a
> new segment size is at least big enough to follow a geometric series
> that approximately doubles the total storage each time we create a new
> segment. Because of this fact, it's not efficient to simply compare
> the memory limit to the total segment size. For example, if
> maintenance_work_mem is 512MB, the total segment size will be like:
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) = 510MB -> less than the
> limit, continue heap scan.
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + 256 = 766MB -> stop (exceed 254MB).
>
> One might think we can use dsa_set_size_limit() but it cannot; lazy
> vacuum ends up with an error. If we set DSA_ALLOC_NO_OOM, we might end
> up stopping the insertion halfway.
>
> Besides excessively allocating memory, since the initial DSM segment
> size is fixed 1MB, memory usage of a shared TidStore will start from
> 1MB+. This is higher than the minimum values of both work_mem and
> maintenance_work_mem, 64kB and 1MB respectively. Increasing the
> minimum m_w_m to 2MB might be acceptable but not for work_mem.
>
> Researching possible solutions, we found that aset.c also has a
> similar characteristic; allocates an 8K block (by default) upon the
> first allocation in a context, and doubles that size for each
> successive block request. But we can specify the initial block size
> and max blocksize. This made me think of an idea to specify both to
> DSA and both values are calculated based on m_w_m. I've attached the
> patch for this idea. The changes to dsa.c are straightforward since
> dsa.c already uses macros DSA_INITIAL_SEGMENT_SIZE and
> DSA_MAX_SEGMENT_SIZE. I just made these values configurable.
>
> FYI with this patch, we can create a DSA in parallel_vacuum_init()
> with initial and maximum block sizes as follows:
>
> initial block size = min(m_w_m / 4, 1MB)
> max block size = max(m_w_m / 8, 8MB)
>
> In most cases, we can start with a 1MB initial segment, the same as
> before. For larger memory, the heap scan stops after DSA allocates
> 1.25 times more memory than m_w_m. For example, if m_w_m = 512MB, the
> both initial and maximum segment sizes are 1MB and 64MB respectively,
> and then DSA allocates the segments as follows until heap scanning
> stops:
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 4) = 510MB -> less than the
> limit, continue heap scan.
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 5) = 574MB -> stop
> (allocated additional 62MB).
>
> It also works with smaller memory; If the limit is 1MB, we start with
> a 256KB initial segment and heap scanning stops after DSA allocated
> 1.5MB (= 256kB + 256kB + 512kB + 512kB).
>
> There is room for considering better formulas for initial and maximum
> block sizes but making both values configurable is a promising idea.
> And the analogous behavior to aset could be a good thing for
> readability and maintainability. There is another test result where I
> used this idea on top of a radix tree[2].
>
> We need to consider the total number of allocated DSA segments as the
> total number of DSM segments available on the system is fixed[3]. But
> it seems not problematic even with this patch since we allocate only a
> few additional segments (in above examples 17 segs vs. 19 segs). There
> was no big difference also in performance[2].
>

The last time I posted this email seemed not good timing since it was
close to the feature freeze, and the email was very long. The tidstore
and radix tree developments are still in-progress[1] and this change
is still necessary. I'd like to summarize the problem and proposal:

* Both the initial DSA segment size and the maximum DSA segment size
are fixed values: 1MB and 1TB respectively.
* The total allocated DSA segments follows a geometric series.
* The patch makes both the initial and maximum DSA segment sizes configurable.
* Which helps:
* minimize wasting memory when the total DSA segment size reaches
the limit set by caller.
* create a data structure with a small memory, for example 64kB,
the minimum value of work_mem.

According to the recent discussion, it might be sufficient to make
only the maximum DSA segment size configurable.

I'll register this item for the next commit fest.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDjTbp2SHn4hRzAxWNeYArn4Yd4UdH9XRoNzdrYWNgExw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-12-19 06:06:51 Re: Synchronizing slots from primary to standby
Previous Message Bharath Rupireddy 2023-12-19 05:55:50 Re: Add a perl function in Cluster.pm to generate WAL