Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-12-08 01:56:50
Message-ID: CAD21AoAK=_E7QhkYVrb9hFiZt3JXbBU=pnyUduR2OYNxr3TBrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 6, 2023 at 3:39 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> > > > Given variable-length value support, RT_GET() would have to do
> > > > repalloc() if the existing value size is not big enough for the new
> > > > value, but it cannot as the radix tree doesn't know the size of each
> > > > stored value.
> > >
> > > I think we have two choices:
> > >
> > > - the value stores the "length". The caller would need to specify a
> > > function to compute size from the "length" member. Note this assumes
> > > there is an array. I think both aspects are not great.
> > > - the value stores the "size". Callers that store an array (as
> > > PageTableEntry's do) would compute length when they need to. This
> > > sounds easier.
> >
> > As for the second idea, do we always need to require the value to have
> > the "size" (e.g. int32) in the first field of its struct? If so, the
> > caller will be able to use only 4 bytes in embedded value cases (or
> > won't be able to use at all if the pointer size is 4 bytes).
>
> We could have an RT_SIZE_TYPE for varlen value types. That's easy.
> There is another way, though: (This is a digression into embedded
> values, but it does illuminate some issues even aside from that)
>
> My thinking a while ago was that an embedded value had no explicit
> length/size, but could be "expanded" into a conventional value for the
> caller. For bitmaps, the smallest full value would have length 1 and
> whatever size (For tid store maybe 16 bytes). This would happen
> automatically via a template function.
>
> Now I think that could be too complicated (especially for page table
> entries, which have more bookkeeping than vacuum needs) and slow.
> Imagine this as an embedded value:
>
> typedef struct BlocktableEntry
> {
> uint16 size;
>
> /* later: uint8 flags; for bitmap scan */
>
> /* 64 bit: 3 elements , 32-bit: 1 element */
> OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
> sizeof(OffsetNumber)];
>
> /* end of embeddable value */
>
> bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> } BlocktableEntry;
>
> Here we can use a slot to store up to 3 offsets, no matter how big
> they are. That's great because a bitmap could be mostly wasted space.

Interesting idea.

> But now the caller can't know up front how many bytes it needs until
> it retrieves the value and sees what's already there. If there are
> already three values, the caller needs to tell the tree "alloc this
> much, update this slot you just gave me with the alloc (maybe DSA)
> pointer, and return the local pointer". Then copy the 3 offsets into
> set bits, and set whatever else it needs to. With normal values, same
> thing, but with realloc.
>
> This is a bit complex, but I see an advantage The tree doesn't need to
> care so much about the size, so the value doesn't need to contain the
> size. For our case, we can use length (number of bitmapwords) without
> the disadvantages I mentioned above, with length zero (or maybe -1)
> meaning "no bitmapword array, the offsets are all in this small
> array".

It's still unclear to me why the value doesn't need to contain the size.

If I understand you correctly, in RT_GET(), the tree allocs a new
memory and updates the slot where the value is embedded with the
pointer to the allocated memory, and returns the pointer to the
caller. Since the returned value, newly allocated memory, is still
empty, the callner needs to copy the contents of the old value to the
new value and do whatever else it needs to.

If the value is already a single-leave value and RT_GET() is called
with a larger size, the slot is always replaced with the newly
allocated area and the caller needs to copy the contents? If the tree
does realloc the value with a new size, how does the tree know the new
value is larger than the existing value? It seems like the caller
needs to provide a function to calculate the size of the value based
on the length.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-12-08 02:24:38 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Michael Paquier 2023-12-08 01:55:24 Re: Test 002_pg_upgrade fails with olddump on Windows