Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-03-03 11:03:53
Message-ID: CAFBsxsFH3f6r3hSmXSX1bvsDe7EFGExa9Sf3Ysd5DKJ=C1bxgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
wrote:
> >
> > I think we're trying to solve the wrong problem here. I need to study
this more, but it seems that code that needs to stay within a memory limit
only needs to track what's been allocated in chunks within a block, since
writing there is what invokes a page fault.
>
> Right. I guess we've discussed what we use for calculating the *used*
> memory amount but I don't remember.
>
> I think I was confused by the fact that we use some different
> approaches to calculate the amount of used memory. Parallel hash and
> tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> allocated block size.

That's good to know. The latter says:

* After adding a new group to the hash table, check whether we need to
enter
* spill mode. Allocations may happen without adding new groups (for
instance,
* if the transition state size grows), so this check is imperfect.

I'm willing to claim that vacuum can be imperfect also, given the tid
store's properties: 1) on average much more efficient in used space, and 2)
no longer bound by the 1GB limit.

> > I'm not sure how this affects progress reporting, because it would be
nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
>
> Yes, the progress reporting could be confusable. Particularly, in
> shared tidstore cases, the dead_tuple_bytes could be much bigger than
> max_dead_tuple_bytes. Probably what we need might be functions for
> MemoryContext and dsa_area to get the amount of memory that has been
> allocated, by not tracking every chunk space. For example, the
> functions would be like what SlabStats() does; iterate over every
> block and calculates the total/free memory usage.

I'm not sure we need to invent new infrastructure for this. Looking at v29
in vacuumlazy.c, the order of operations for memory accounting is:

First, get the block-level space -- stop and vacuum indexes if we exceed
the limit:

/*
* Consider if we definitely have enough space to process TIDs on page
* already. If we are close to overrunning the available space for
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if
(TidStoreMemoryUsage(ts) > ts->control->max_bytes)"

Then, after pruning the current page, store the tids and then get the
block-level space again:

else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
TidStoreSetBlockOffsets(...);

pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
TidStoreMemoryUsage(dead_items));
}

Since the block-level measurement is likely overestimating quite a bit, I
propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.

Thoughts?

But now that I'm looking more closely at the details of memory accounting,
I don't like that TidStoreMemoryUsage() is called twice per page pruned
(see above). Maybe it wouldn't noticeably slow things down, but it's a bit
sloppy. It seems like we should call it once per loop and save the result
somewhere. If that's the right way to go, that possibly indicates that
TidStoreIsFull() is not a useful interface, at least in this form.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-03-03 11:04:30 Re: Deduplicate logicalrep_read_tuple()
Previous Message Pavel Luzanov 2023-03-03 11:01:59 Re: psql: Add role's membership options to the \du+ command