Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-12-06 06:39:21
Message-ID: CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:

> > > Given variable-length value support, RT_GET() would have to do
> > > repalloc() if the existing value size is not big enough for the new
> > > value, but it cannot as the radix tree doesn't know the size of each
> > > stored value.
> >
> > I think we have two choices:
> >
> > - the value stores the "length". The caller would need to specify a
> > function to compute size from the "length" member. Note this assumes
> > there is an array. I think both aspects are not great.
> > - the value stores the "size". Callers that store an array (as
> > PageTableEntry's do) would compute length when they need to. This
> > sounds easier.
>
> As for the second idea, do we always need to require the value to have
> the "size" (e.g. int32) in the first field of its struct? If so, the
> caller will be able to use only 4 bytes in embedded value cases (or
> won't be able to use at all if the pointer size is 4 bytes).

We could have an RT_SIZE_TYPE for varlen value types. That's easy.
There is another way, though: (This is a digression into embedded
values, but it does illuminate some issues even aside from that)

My thinking a while ago was that an embedded value had no explicit
length/size, but could be "expanded" into a conventional value for the
caller. For bitmaps, the smallest full value would have length 1 and
whatever size (For tid store maybe 16 bytes). This would happen
automatically via a template function.

Now I think that could be too complicated (especially for page table
entries, which have more bookkeeping than vacuum needs) and slow.
Imagine this as an embedded value:

typedef struct BlocktableEntry
{
uint16 size;

/* later: uint8 flags; for bitmap scan */

/* 64 bit: 3 elements , 32-bit: 1 element */
OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
sizeof(OffsetNumber)];

/* end of embeddable value */

bitmapword words[FLEXIBLE_ARRAY_MEMBER];
} BlocktableEntry;

Here we can use a slot to store up to 3 offsets, no matter how big
they are. That's great because a bitmap could be mostly wasted space.
But now the caller can't know up front how many bytes it needs until
it retrieves the value and sees what's already there. If there are
already three values, the caller needs to tell the tree "alloc this
much, update this slot you just gave me with the alloc (maybe DSA)
pointer, and return the local pointer". Then copy the 3 offsets into
set bits, and set whatever else it needs to. With normal values, same
thing, but with realloc.

This is a bit complex, but I see an advantage The tree doesn't need to
care so much about the size, so the value doesn't need to contain the
size. For our case, we can use length (number of bitmapwords) without
the disadvantages I mentioned above, with length zero (or maybe -1)
meaning "no bitmapword array, the offsets are all in this small
array".

> > > Another idea is that the radix tree returns the pointer
> > > to the slot and the caller updates the value accordingly.
> >
> > I did exactly this in v43 TidStore if I understood you correctly. If I
> > misunderstood you, can you clarify?
>
> I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
> the value as they want.

Did my sketch above get closer to that? Side note: I don't think we
can expose that directly (e.g. need to check for create or extend
upwards), but some functionality can be a thin wrapper around it.

> > However, for vacuum, we have all values that we need up front. That
> > gives me an idea: Something like this insert API could be optimized
> > for "insert-only": If we only free values when we free the whole tree
> > at the end, that's a clear use case for David Rowley's proposed "bump
> > context", which would save 8 bytes per allocation and be a bit faster.
> > [1] (RT_GET for varlen values would use an aset context, to allow
> > repalloc, and nodes would continue to use slab).
>
> Interesting idea and worth trying it. Do we need to protect the whole
> tree as insert-only for safety? It's problematic if the user uses
> mixed RT_INSERT() and RT_GET().

You're right, but I'm not sure what the policy should be.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shubham Khanna 2023-12-06 06:45:50 Re: Remove MSVC scripts from the tree
Previous Message Sutou Kouhei 2023-12-06 06:19:08 Re: Make COPY format extendable: Extract COPY TO format implementations