Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-09-21 06:17:21
Message-ID: CAFBsxsGCW=CfgzatJOQ9iuz5UbrMaqPNpP7UZ-ON+2Pgh=zCwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Fri, Sep 16, 2022 at 4:54 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:

> > Here again, I'd rather put this off and focus on getting the "large
> > details" in good enough shape so we can got towards integrating with
> > vacuum.
>
> Thank you for the comments! These above comments are addressed by
> Nathan in a newly derived thread. I'll work on the patch.

I still seem to be out-voted on when to tackle this particular
optimization, so I've extended the v6 benchmark code with a hackish
function that populates a fixed number of keys, but with different fanouts.
(diff attached as a text file)

I didn't take particular care to make this scientific, but the following
seems pretty reproducible. Note what happens to load and search performance
when node16 has 15 entries versus 16:

fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
15 | 327680 | 3776512 | 39 | 20
(1 row)
num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256
= 0

fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
16 | 327680 | 3514368 | 25 | 11
(1 row)
num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256
= 0

In trying to wrap the SIMD code behind layers of abstraction, the latest
patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
we need to talk about how vectorized code deals with the "tail" that is too
small for the register:

1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the
result.

There are advantages to both depending on the situation.

Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
elements or less, it will iterate over them one-by-one exactly like a
node4. Only when full with 16 will the vector path be taken. When another
entry is added, the elements are copied to the next bigger node, so there's
a *small* window where it's fast.

In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.

Since v6 has some good infrastructure to do low-level benchmarking, I also
want to do some experiments with memory management.

(I have further comments about the code, but I will put that off until
later)

> I'll consider how to integrate with vacuum as the next step. One
> concern for me is how to limit the memory usage to
> maintenance_work_mem. Unlike using a flat array, memory space for
> adding one TID varies depending on the situation. If we want strictly
> not to allow using memory more than maintenance_work_mem, probably we
> need to estimate the memory consumption in a conservative way.

+1

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v6addendum-bench-node16.diff.txt text/plain 6.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2022-09-21 06:28:08 Re: make additional use of optimized linear search routines
Previous Message Fujii Masao 2022-09-21 05:40:34 Re: Fix snapshot name for SET TRANSACTION documentation