Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-02-11 05:33:42
Message-ID: CAFBsxsFps1MReb3HQYEGc=M=1NXRjSu3TSaEVEXkVrCR3g-hxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I didn't get any closer to radix-tree regression, but I did find some
inefficiencies in tidstore_add_tids() that are worth talking about first,
addressed in a rough fashion in the attached .txt addendums that I can
clean up and incorporate later.

To start, I can reproduce the regression with this test as well:

select * from bench_tidstore_load(0, 10 * 1000 * 1000);

v15 + v26 store + adjustments:
mem_allocated | load_ms
---------------+---------
98202152 | 1676

v26 0001-0008
mem_allocated | load_ms
---------------+---------
98202032 | 1826

...and reverting to the alternate way to update the parent didn't help:

v26 0001-6, 0008, insert_inner w/ null parent

mem_allocated | load_ms
---------------+---------
98202032 | 1825

...and I'm kind of glad that wasn't the problem, because going back to that
would be a pain for the shmem case.

Running perf doesn't show anything much different in the proportions (note
that rt_set must have been inlined when declared locally in v26):

v15 + v26 store + adjustments:
65.88% postgres postgres [.] tidstore_add_tids
10.74% postgres postgres [.] rt_set
9.20% postgres postgres [.] palloc0
6.49% postgres postgres [.] rt_node_insert_leaf

v26 0001-0008
78.50% postgres postgres [.] tidstore_add_tids
8.88% postgres postgres [.] palloc0
6.24% postgres postgres [.] local_rt_node_insert_leaf

v2699-0001: The first thing I noticed is that palloc0 is taking way more
time than it should, and it's because the compiler doesn't know the
values[] array is small. One reason we need to zero the array is to make
the algorithm agnostic about what order the offsets come in, as I requested
in a previous review. Thinking some more, I was way too paranoid about
that. As long as access methods scan the line pointer array in the usual
way, maybe we can just assert that the keys we create are in order, and
zero any unused array entries as we find them. (I admit I can't actually
think of a reason we would ever encounter offsets out of order.) Also, we
can keep track of the last key we need to consider for insertion into the
radix tree, and ignore the rest. That might shave a few cycles during the
exclusive lock when the max offset of an LP_DEAD item < 64 on a given page,
which I think would be common in the wild. I also got rid of the special
case for non-encoding, since shifting by zero should work the same way.
These together led to a nice speedup on the v26 branch:

mem_allocated | load_ms
---------------+---------
98202032 | 1386

v2699-0002: The next thing I noticed is forming a full ItemIdPointer to
pass to tid_to_key_off(). That's bad for tidstore_add_tids() because
ItemPointerSetBlockNumber() must do this in order to allow the struct to be
SHORTALIGN'd:

static inline void
BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
{
blockId->bi_hi = blockNumber >> 16;
blockId->bi_lo = blockNumber & 0xffff;
}

Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must
reverse the above process:

static inline BlockNumber
BlockIdGetBlockNumber(const BlockIdData *blockId)
{
return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber)
blockId->bi_lo);
}

There is no reason to do any of this if we're not reading/writing directly
to/from an on-disk tid etc. To avoid this, I created a new function
encode_key_off() [name could be better], which deals with the raw block
number that we already have. Then turn tid_to_key_off() into a wrapper
around that, since we still need the full conversion for
tidstore_lookup_tid().

v2699-0003: Get rid of all the remaining special cases for encoding/or not.
I am unaware of the need to optimize that case or treat it in any way
differently. I haven't tested this on an installation with non-default
blocksize and didn't measure this separately, but 0002+0003 gives:

mem_allocated | load_ms
---------------+---------
98202032 | 1259

If these are acceptable, I can incorporate them into a later patchset. In
any case, speeding up tidstore_add_tids() will make any regressions in the
backing radix tree more obvious. I will take a look at that next week.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v2699-0002-Do-less-work-when-encoding-key-value.patch.txt text/plain 2.7 KB
v2699-0001-Miscellaneous-optimizations-for-tidstore_add_t.patch.txt text/plain 3.0 KB
v2699-0003-Force-all-callers-to-encode-no-matter-how-smal.patch.txt text/plain 2.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Takamichi Osumi (Fujitsu) 2023-02-11 05:44:47 RE: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Robert Haas 2023-02-11 03:50:56 Re: ICU locale validation / canonicalization