Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-03-17 07:02:53
Message-ID: CAFBsxsGiiyY+wykVLBbN9hFUMiNHqEr_Kqg9Mpc=uv4sg8eagQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> >
> > I wrote:
> >
> > > > > Since the block-level measurement is likely overestimating quite
a bit, I propose to simply reverse the order of the actions here,
effectively reporting progress for the *last page* and not the current one:
First update progress with the current memory usage, then add tids for this
page. If this allocated a new block, only a small bit of that will be
written to. If this block pushes it over the limit, we will detect that up
at the top of the loop. It's kind of like our earlier attempts at a "fudge
factor", but simpler and less brittle. And, as far as OS pages we have
actually written to, I think it'll effectively respect the memory limit, at
least in the local mem case. And the numbers will make sense.

> > I still like my idea at the top of the page -- at least for vacuum and
m_w_m. It's still not completely clear if it's right but I've got nothing
better. It also ignores the work_mem issue, but I've given up anticipating
all future cases at the moment.

> IIUC you suggested measuring memory usage by tracking how much memory
> chunks are allocated within a block. If your idea at the top of the
> page follows this method, it still doesn't deal with the point Andres
> mentioned.

Right, but that idea was orthogonal to how we measure memory use, and in
fact mentions blocks specifically. The re-ordering was just to make sure
that progress reporting didn't show current-use > max-use.

However, the big question remains DSA, since a new segment can be as large
as the entire previous set of allocations. It seems it just wasn't designed
for things where memory growth is unpredictable.

I'm starting to wonder if we need to give DSA a bit more info at the start.
Imagine a "soft" limit given to the DSA area when it is initialized. If the
total segment usage exceeds this, it stops doubling and instead new
segments get smaller. Modifying an example we used for the fudge-factor
idea some time ago:

m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA
area.

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so
"stairstep down" the new segment sizes:

766 + 2*(128) + 64 = 1086MB -> stop

That's just an undeveloped idea, however, so likely v17 development, even
assuming it's not a bad idea (could be).

And sadly, unless we find some other, simpler answer soon for tracking and
limiting shared memory, the tid store is looking like v17 material.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-03-17 07:20:09 Re: Add pg_walinspect function with block info columns
Previous Message Andres Freund 2023-03-17 06:58:29 Re: slapd logs to syslog during tests