Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-02-14 11:24:38
Message-ID: CAFBsxsESfmZpWwk3XxWkpdg_DjawH2vizKW91fyawQ+wVFkOiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Sat, Feb 11, 2023 at 2:33 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > I didn't get any closer to radix-tree regression,
>
> Me neither. It seems that in v26, inserting chunks into node-32 is
> slow but needs more analysis. I'll share if I found something
> interesting.

If that were the case, then the other benchmarks I ran would likely have
slowed down as well, but they are the same or faster. There is one
microbenchmark I didn't run before: "select * from
bench_fixed_height_search(15)" (15 to reduce noise from growing size class,
and despite the name it measures load time as well). Trying this now shows
no difference: a few runs range 19 to 21ms in each version. That also
reinforces that update_inner is fine and that the move to value pointer API
didn't regress.

Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of
5, perf run separate from measurements):

v15 + v26 store:

mem_allocated | load_ms
---------------+---------
98202152 | 553

19.71% postgres postgres [.] tidstore_add_tids
+ 31.47% postgres postgres [.] rt_set
= 51.18%

20.62% postgres postgres [.] rt_node_insert_leaf
6.05% postgres postgres [.] AllocSetAlloc
4.74% postgres postgres [.] AllocSetFree
4.62% postgres postgres [.] palloc
2.23% postgres postgres [.] SlabAlloc

v26:

mem_allocated | load_ms
---------------+---------
98202032 | 617

57.45% postgres postgres [.] tidstore_add_tids

20.67% postgres postgres [.] local_rt_node_insert_leaf
5.99% postgres postgres [.] AllocSetAlloc
3.55% postgres postgres [.] palloc
3.05% postgres postgres [.] AllocSetFree
2.05% postgres postgres [.] SlabAlloc

So it seems the store itself got faster when we removed shared memory paths
from the v26 store to test it against v15.

I thought to favor the local memory case in the tidstore by controlling
inlining -- it's smaller and will be called much more often, so I tried the
following (done in 0007)

#define RT_PREFIX shared_rt
#define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline

That brings it down to

mem_allocated | load_ms
---------------+---------
98202032 | 590

That's better, but not still not within noise level. Perhaps some slowdown
is unavoidable, but it would be nice to understand why.

> I can think that something like traversing a HOT chain could visit
> offsets out of order. But fortunately we prune such collected TIDs
> before heap vacuum in heap case.

Further, currently we *already* assume we populate the tid array in order
(for binary search), so we can just continue assuming that (with an assert
added since it's more public in this form). I'm not sure why such basic
common sense evaded me a few versions ago...

> > If these are acceptable, I can incorporate them into a later patchset.
>
> These are nice improvements! I agree with all changes.

Great, I've squashed these into the tidstore patch (0004). Also added 0005,
which is just a simplification.

I squashed the earlier dead code removal into the radix tree patch.

v27-0008 measures tid store iteration performance and adds a stub function
to prevent spurious warnings, so the benchmarking module can always be
built.

Getting the list of offsets from the old array for a given block is always
trivial, but tidstore_iter_extract_tids() is doing a huge amount of
unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the
load time:

mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 915

Fortunately, it's an easy fix, done in 0009.

mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 153

I'll soon resume more cosmetic review of the tid store, but this is enough
to post.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v27-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch text/x-patch 2.9 KB
v27-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch text/x-patch 6.1 KB
v27-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patch text/x-patch 3.8 KB
v27-0003-Add-radixtree-template.patch text/x-patch 116.8 KB
v27-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch text/x-patch 35.5 KB
v27-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patch text/x-patch 23.9 KB
v27-0008-Measure-iteration-of-tidstore.patch text/x-patch 3.7 KB
v27-0007-Prevent-inlining-of-interface-functions-for-shme.patch text/x-patch 750 bytes
v27-0009-Speed-up-tidstore_iter_extract_tids.patch text/x-patch 1.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-02-14 12:30:00 Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Previous Message David Geier 2023-02-14 11:11:01 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?