Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-12-20 05:03:58
Message-ID: CAD21AoAYA0fnPC7ww2v1USvn3VQ52Zn5XK6gyhWbaXuXn16Unw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 19, 2022 at 4:13 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 12, 2022 at 7:14 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > >
> > >
> > > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > > > >
> > >
> > > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).
> > > >
> > > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> > > > the radix tree is using dsa_get_total_size(). If the size returned by
> > > > dsa_get_total_size() (+ some memory used by TIDStore meta information)
> > > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> > > > and heap vacuum. However, when allocating DSA memory for
> > > > radix_tree_control at creation, we allocate 1MB
> > > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> > > > radix_tree_control from it. das_get_total_size() returns 1MB even if
> > > > there is no TID collected.
> > >
> > > 2MB makes sense.
> > >
> > > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.
> > >
> > > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.
> > > >
> > > > Right, I think it's no problem in slab cases. In DSA cases, the new
> > > > segment size follows a geometric series that approximately doubles the
> > > > total storage each time we create a new segment. This behavior comes
> > > > from the fact that the underlying DSM system isn't designed for large
> > > > numbers of segments.
> > >
> > > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:
> > >
> > > maintenance work mem = 256MB, so stop if we go over 128MB:
> > >
> > > 2*(1+2+4+8+16+32) = 126MB -> keep going
> > > 126MB + 64 = 190MB -> stop
> > >
> > > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).
> >
> > Right. In this case, even if we allocate 64MB, we will use only 2088
> > bytes at maximum. So I think the memory space used for vacuum is
> > practically limited to half.
> >
> > >
> > > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.
> >
> > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> > seems that they look at only memory that are actually dsa_allocate'd.
> > To be exact, we estimate the number of hash buckets based on work_mem
> > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > confirmed that the result of dsa_get_total_size() could exceed the
> > limit. I'm not sure it's a known and legitimate usage. If we can
> > follow such usage, we can probably track how much dsa_allocate'd
> > memory is used in the radix tree.
>
> I've experimented with this idea. The newly added 0008 patch changes
> the radix tree so that it counts the memory usage for both local and
> shared cases.

I've attached updated version patches to make cfbot happy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v15-0008-PoC-calculate-memory-usage-in-radix-tree.patch application/octet-stream 11.7 KB
v15-0005-tool-for-measuring-radix-tree-performance.patch application/octet-stream 20.0 KB
v15-0007-PoC-DSA-support-for-radix-tree.patch application/octet-stream 39.6 KB
v15-0009-PoC-lazy-vacuum-integration.patch application/octet-stream 45.1 KB
v15-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patch application/octet-stream 57.3 KB
v15-0004-Use-bitmapword-for-node-125.patch application/octet-stream 5.9 KB
v15-0001-introduce-vector8_min-and-vector8_highbit_mask.patch application/octet-stream 2.6 KB
v15-0003-Add-radix-implementation.patch application/octet-stream 90.8 KB
v15-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch application/octet-stream 6.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-12-20 05:05:52 Re: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Michael Paquier 2022-12-20 04:40:00 Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures