Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-03-01 06:37:01
Message-ID: CAFBsxsEnzivaJ13iCGdDoUMsXJVGOaahuBe_y=q6ow=LTzyDvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
> >
> > On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > >
> > >
> > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <
sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > > > >
> > > > > That doesn't seem useful to me. If we've done enough testing to
reassure us the new way always gives the same answer, the old way is not
needed at commit time. If there is any doubt it will always give the same
answer, then the whole patchset won't be committed.
> > >
> > > > My idea is to make the bug investigation easier but on
> > > > reflection, it seems not the best idea given this purpose.
> > >
> > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics
the old tid array. As I've said, that doesn't seem like a good thing to
carry forward forevermore, in any form. Plus, comparing new code with new
code is not the same thing as comparing existing code with new code. That
was my idea upthread.
> > >
> > > Maybe the effort my idea requires is too much vs. the likelihood of
finding a problem. In any case, it's clear that if I want that level of
paranoia, I'm going to have to do it myself.
> > >
> > > > What do you think
> > > > about the attached patch? Please note that it also includes the
> > > > changes for minimum memory requirement.
> > >
> > > Most of the asserts look logical, or at least harmless.
> > >
> > > - int max_off; /* the maximum offset number */
> > > + OffsetNumber max_off; /* the maximum offset number */
> > >
> > > I agree with using the specific type for offsets here, but I'm not
sure why this change belongs in this patch. If we decided against the new
asserts, this would be easy to lose.
> >
> > Right. I'll separate this change as a separate patch.
> >
> > >
> > > This change, however, defies common sense:
> > >
> > > +/*
> > > + * The minimum amount of memory required by TidStore is 2MB, the
current minimum
> > > + * valid value for the maintenance_work_mem GUC. This is required to
allocate the
> > > + * DSA initial segment, 1MB, and some meta data. This number is
applied also to
> > > + * the local TidStore cases for simplicity.
> > > + */
> > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> > >
> > > + /* Sanity check for the max_bytes */
> > > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu
provided",
> > > + TIDSTORE_MIN_MEMORY, max_bytes);
> > >
> > > Aside from the fact that this elog's something that would never get
past development, the #define just adds a hard-coded copy of something that
is already hard-coded somewhere else, whose size depends on an
implementation detail in a third place.
> > >
> > > This also assumes that all users of tid store are limited by
maintenance_work_mem. Andres thought of an example of some day unifying
with tidbitmap.c, and maybe other applications will be limited by work_mem.
> > >
> > > But now that I'm looking at the guc tables, I am reminded that
work_mem's minimum is 64kB, so this highlights a design problem: There is
obviously no requirement that the minimum work_mem has to be >= a single
DSA segment, even though operations like parallel hash and parallel bitmap
heap scan are limited by work_mem.
> >
> > Right.
> >
> > > It would be nice to find out what happens with these parallel
features when work_mem is tiny (maybe parallelism is not even considered?).
> >
> > IIUC both don't care about the allocated DSA segment size. Parallel
> > hash accounts actual tuple (+ header) size as used memory but doesn't
> > consider how much DSA segment is allocated behind. Both parallel hash
> > and parallel bitmap scan can work even with work_mem = 64kB, but when
> > checking the total DSA segment size allocated during these operations,
> > it was 1MB.
> >
> > I realized that there is a similar memory limit design issue also on
> > the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> > won't work fine with work_mem = 64kB. Probably we need to reconsider
> > it. FYI 70kB comes from the maximum slab block size for node256.
>
> Currently, we calculate the slab block size enough to allocate 32
> chunks from there. For node256, the leaf node is 2,088 bytes and the
> slab block size is 66,816 bytes. One idea to fix this issue to
> decrease it.

I think we're trying to solve the wrong problem here. I need to study this
more, but it seems that code that needs to stay within a memory limit only
needs to track what's been allocated in chunks within a block, since
writing there is what invokes a page fault. If we're not keeping track of
each and every chunk space, for speed, it doesn't follow that we need to
keep every block allocation within the configured limit. I'm guessing we
can just ask the context if the block space has gone *over* the limit, and
we can assume that the last allocation we perform will only fault one
additional page. We need to have a clear answer on this before doing
anything else.

If that's correct, and I'm not positive yet, we can get rid of all the
fragile assumptions about things the tid store has no business knowing
about, as well as the guc change. I'm not sure how this affects progress
reporting, because it would be nice if it didn't report dead_tuple_bytes
bigger than max_dead_tuple_bytes.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-03-01 06:47:17 Re: Commitfest 2023-03 starting tomorrow!
Previous Message Bharath Rupireddy 2023-03-01 06:35:58 Re: LOG: invalid record length at <LSN> : wanted 24, got 0