Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-09-28 06:18:35
Message-ID: CAFBsxsG2SDMpmON_ovVY6S+oCT-Mn=9giMm4_YquHdkZfn7XBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:

> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> [...]

Thanks for that info -- I wasn't familiar with the details of DSA. For the
non-parallel case, I plan to at least benchmark using aset because I gather
it's the most heavily optimized. I'm thinking that will allow other problem
areas to be more prominent. I'll also want to compare total context size
compared to slab to see if possibly less fragmentation makes up for other
wastage.

Along those lines, one thing I've been thinking about is the number of size
classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.

I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size. To show what I mean, consider this new
table:

node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64

node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)

node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048

node256: 5 + 32 +(3)+256*8 = 2088(->4096)

Seven size classes are grouped into the four kinds.

The common base at the front is here 5 bytes because there is a new uint8
field for "capacity", which we can ignore for node256 since we assume we
can always insert/update that node. The control data is the same in each
pair, and so the offset to the pointer/value array is the same. Thus,
migration would look something like:

case FOO_KIND:
if (unlikely(count == capacity))
{
if (capacity == XYZ) /* for smaller size class of the pair */
{
<repalloc to next size class>;
capacity = next-higher-capacity;
goto do_insert;
}
else
<migrate data to next node kind>;
}
else
{
do_insert:
<...>;
break;
}
/* FALLTHROUGH */
...

One disadvantage is that this wastes some space by reserving the full set
of control data in the smaller size class of the pair, but it's usually
small compared to array size. Somewhat unrelated, we could still implement
Andres' idea [1] to dispense with the isset array in inner nodes of the
indirect array type (now node128), since we can just test if the pointer is
null.

[1]
https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2022-09-28 06:18:36 Re: longfin and tamandua aren't too happy but I'm not sure why
Previous Message Kyotaro Horiguchi 2022-09-28 06:12:51 Re: Avoid memory leaks during base backups