Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-10-05 06:45:38
Message-ID: CAD21AoBLDXhAcN8raoDDrK=SxNF=WHUCOxiBD+cLXUDMx5k9pg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> >
> > On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > > One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in
> >
> > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
> >
> > While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind:
> >
> > uint16 -- kind and count bitfield
> > uint8 shift;
> > uint8 chunk;
> >
> > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only.
>
> Good point, agreed.
>
> >
> > Here are the v6 node kinds:
> >
> > node4: 8 + 4 +(4) + 4*8 = 48 bytes
> > node16: 8 + 16 + 16*8 = 152
> > node32: 8 + 32 + 32*8 = 296
> > node128: 8 + 256 + 128/8 + 128*8 = 1304
> > node256: 8 + 256/8 + 256*8 = 2088
> >
> > And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off:
> >
> > node3: 4 + 3 +(1) + 3*8 = 32 bytes
> > node6: 4 + 6 +(6) + 6*8 = 64
> > node13: 4 + 13 +(7) + 13*8 = 128
> > node28: 4 + 28 + 28*8 = 256
> > node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
> > node94: 4 + 256 + 96/8 + 94*8 = 1024
> > node220: 4 + 256 + 224/8 + 220*8 = 2048
> > node256: = 4096
> >
> > The main disadvantage is that node256 would balloon in size.
>
> Yeah, node31 and node256 are bloated. We probably could use slab for
> node256 independently. It's worth trying a benchmark to see how it
> affects the performance and the tree size.
>
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> sizeof(dsa_area_span), 0, /* special size classes */
> 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
> 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
> 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
> 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
> 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
> 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
> 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
> 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
> };
>
> node256 will be classed as 2616, which is still not good.
>
> Anyway, I'll implement DSA support for radix tree.
>

Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-10-05 07:24:43 Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
Previous Message Peter Eisentraut 2022-10-05 06:40:59 Re: [RFC] building postgres with meson - v13