Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-03-06 06:27:44
Message-ID: CAD21AoAYDcPFrYwHphnPHVXqJODKtxJRM_u8fgztRP06d31f1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > >
> > > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.
> >
> > Right. I guess we've discussed what we use for calculating the *used*
> > memory amount but I don't remember.
> >
> > I think I was confused by the fact that we use some different
> > approaches to calculate the amount of used memory. Parallel hash and
> > tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> > in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> > allocated block size.
>
> That's good to know. The latter says:
>
> * After adding a new group to the hash table, check whether we need to enter
> * spill mode. Allocations may happen without adding new groups (for instance,
> * if the transition state size grows), so this check is imperfect.
>
> I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficient in used space, and 2) no longer bound by the 1GB limit.
>
> > > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
> >
> > Yes, the progress reporting could be confusable. Particularly, in
> > shared tidstore cases, the dead_tuple_bytes could be much bigger than
> > max_dead_tuple_bytes. Probably what we need might be functions for
> > MemoryContext and dsa_area to get the amount of memory that has been
> > allocated, by not tracking every chunk space. For example, the
> > functions would be like what SlabStats() does; iterate over every
> > block and calculates the total/free memory usage.
>
> I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations for memory accounting is:
>
> First, get the block-level space -- stop and vacuum indexes if we exceed the limit:
>
> /*
> * Consider if we definitely have enough space to process TIDs on page
> * already. If we are close to overrunning the available space for
> * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
> * this page.
> */
> if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)"
>
> Then, after pruning the current page, store the tids and then get the block-level space again:
>
> else if (prunestate.num_offsets > 0)
> {
> /* Save details of the LP_DEAD items from the page in dead_items */
> TidStoreSetBlockOffsets(...);
>
> pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
> TidStoreMemoryUsage(dead_items));
> }
>
> Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
>
> Thoughts?

It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

BTW I realized that since the caller can pass dsa_area to tidstore
(and radix tree), if other data are allocated in the same DSA are,
TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage
that includes not only itself but also other data. Probably it's
better to comment that the passed dsa_area should be dedicated to a
tidstore (or a radix tree).

>
> But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is called twice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems like we should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicates that TidStoreIsFull() is not a useful interface, at least in this form.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ajin Cherian 2023-03-06 06:34:25 Re: Support logical replication of DDLs
Previous Message Kyotaro Horiguchi 2023-03-06 06:24:25 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)