Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-02-13 07:50:57
Message-ID: CAD21AoDqO7v5cvFUyWD7_pwceJ1XfhA81FVLTspKn0sFxtqNPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> I didn't get any closer to radix-tree regression,

Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.

> but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough fashion in the attached .txt addendums that I can clean up and incorporate later.
>
> To start, I can reproduce the regression with this test as well:
>
> select * from bench_tidstore_load(0, 10 * 1000 * 1000);
>
> v15 + v26 store + adjustments:
> mem_allocated | load_ms
> ---------------+---------
> 98202152 | 1676
>
> v26 0001-0008
> mem_allocated | load_ms
> ---------------+---------
> 98202032 | 1826
>
> ...and reverting to the alternate way to update the parent didn't help:
>
> v26 0001-6, 0008, insert_inner w/ null parent
>
> mem_allocated | load_ms
> ---------------+---------
> 98202032 | 1825
>
> ...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case.
>
> Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when declared locally in v26):
>
> v15 + v26 store + adjustments:
> 65.88% postgres postgres [.] tidstore_add_tids
> 10.74% postgres postgres [.] rt_set
> 9.20% postgres postgres [.] palloc0
> 6.49% postgres postgres [.] rt_node_insert_leaf
>
> v26 0001-0008
> 78.50% postgres postgres [.] tidstore_add_tids
> 8.88% postgres postgres [.] palloc0
> 6.24% postgres postgres [.] local_rt_node_insert_leaf
>
> v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the compiler doesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm agnostic about what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too paranoid about that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert that the keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think of a reason we would ever encounter offsets out of order.)

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

> Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest. That might shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page, which I think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero should work the same way. These together led to a nice speedup on the v26 branch:
>
> mem_allocated | load_ms
> ---------------+---------
> 98202032 | 1386
>
> v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for tidstore_add_tids() because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd:
>
> static inline void
> BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
> {
> blockId->bi_hi = blockNumber >> 16;
> blockId->bi_lo = blockNumber & 0xffff;
> }
>
> Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process:
>
> static inline BlockNumber
> BlockIdGetBlockNumber(const BlockIdData *blockId)
> {
> return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo);
> }
>
> There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this, I created a new function encode_key_off() [name could be better], which deals with the raw block number that we already have. Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for tidstore_lookup_tid().
>
> v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that case or treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't measure this separately, but 0002+0003 gives:
>
> mem_allocated | load_ms
> ---------------+---------
> 98202032 | 1259
>
> If these are acceptable, I can incorporate them into a later patchset.

These are nice improvements! I agree with all changes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2023-02-13 08:14:44 Re: ExecRTCheckPerms() and many prunable partitions (sqlsmith)
Previous Message Kyotaro Horiguchi 2023-02-13 07:40:58 Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry