Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-06-14 06:23:07
Message-ID: CAFBsxsFsd_WnS7c-aRX9xEEhRnA3u9aBumH+_HmmmY7bCjOoBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 13, 2023 at 12:47 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
wrote:
> >

> > I'd need to make sure, but apparently just going from 6 non-empty
memory contexts to 3 (remember all values are embedded here) reduces memory
fragmentation significantly in this test. (That should also serve as a
demonstration that additional size classes have both runtime costs as well
as benefits. We need to have a balance.)
>
> Interesting. The result would probably vary if we change the slab
> block sizes. I'd like to experiment if the code is available.

I cleaned up a few things and attached v34 so you can do that if you like.
(Note: what I said about node63/n125 not making a difference in that one
test is not quite true since slab keeps a few empty blocks around. I did
some rough mental math and I think it doesn't change the conclusion any.)

0001-0007 is basically v33, but can apply on master.

0008 just adds back RT_EXTEND_DOWN. I left it out to simplify moving to
recursion.

> > Oh, the memcpy part is great, very simple. I mean the (compile-time)
"class info" table lookups are a bit awkward. I'm thinking the hard-coded
numbers like this:
> >
> > .fanout = 3,
> > .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
> >
> > ...may be better with a #defined symbol that can also be used elsewhere.
>
> FWIW, exposing these definitions would be good in terms of testing too
> since we can use them in regression tests.

I added some definitions in 0012. It kind of doesn't matter now what sizes
are the test unless it also can test that it stays within the expected
size, if that makes sense. It is helpful during debugging to force growth
to stop at a certain size.

> > > Within the size class, we just alloc a new node of lower size class
> > > and do memcpy().

Not anymore. ;-) To be technical, it didn't "just" memcpy(), since it then
fell through to find the insert position and memmove(). In some parts of
Andres' prototype, no memmove() is necessary, because it memcpy()'s around
the insert position, and puts the new child in the right place. I've done
this in 0009.

The memcpy you mention was done for 1) simplicity 2) to avoid memset'ing.
Well, it was never necessary to memset the whole node in the first place.
Only the header, slot index array, and isset arrays need to be zeroed, so
in 0011 we always do only that. That combines alloc and init functionality,
and it's simple everywhere.

In 0010 I restored iteration functionality -- it can no longer get the
shift from the node, because it's not there as of v33. I was not
particularly impressed that there were no basic iteration tests, and in
fact the test_pattern test relied on functioning iteration. I added some
basic tests. I'm not entirely pleased with testing overall, but I think
it's at least sufficient for the job. I had the idea to replace "shift"
everywhere and use "level" as a fundamental concept. This is clearer. I do
want to make sure the compiler can compute the shift efficiently where
necessary. I think that can wait until much later.

0013 standardizes (mostly) on 4/16/48/256 for naming convention, regardless
of actual size, as I started to do earlier.

0014 is part cleanup of shrinking, and part making grow-node-48 more
consistent with the rest.

> > The additional bit per slot would require per-node logic and additional
branches, which is not great. I'm now thinking a much easier way to get
there is to give up (at least for now) on promising that "run-time
embeddable values" can use the full pointer-size (unlike value types found
embeddable at compile-time). Reserving the lowest pointer bit for a tag
"value or pointer-to-leaf" would have a much smaller code footprint.
>
> Do you mean we can make sure that the value doesn't set the lowest
> bit? Or is it an optimization for TIDStore?

It will be up to the caller (the user of the template) -- if an
abbreviation is possible that fits in the upper 63 bits (with something to
guard for 32-bit platforms), the developer will be able to specify a
conversion function so that the caller only sees the full value when
searching and setting. Without such a function, the template will fall back
to the size of the value type to determine how the value is stored.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v34-ART.tar.gz application/gzip 75.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2023-06-14 06:31:02 Re: pg_get_indexdef() modification to use TxnSnapshot
Previous Message Sergey Dudoladov 2023-06-14 05:50:35 Re: Add connection active, idle time to pg_stat_activity