Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-03-20 14:34:08
Message-ID: CAD21AoDKr=4YHphy6cRojE5eyT6E2ao8xb44E309eTrUEOC6xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
>
> On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Mar 17, 2023 at 4:03 PM John Naylor
> > > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > > >
> > > > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > > > > <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > > > > >
> > > > > > I wrote:
> > > > > >
> > > > > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> > > >
> > > > > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
> > > >
> > > > > IIUC you suggested measuring memory usage by tracking how much memory
> > > > > chunks are allocated within a block. If your idea at the top of the
> > > > > page follows this method, it still doesn't deal with the point Andres
> > > > > mentioned.
> > > >
> > > > Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.
> > >
> > > Right. I still like your re-ordering idea. It's true that the most
> > > area of the last allocated block before heap scanning stops is not
> > > actually used yet. I'm guessing we can just check if the context
> > > memory has gone over the limit. But I'm concerned it might not work
> > > well in systems where overcommit memory is disabled.
> > >
> > > >
> > > > However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
> >
> > aset.c also has a similar characteristic; allocates an 8K block upon
> > the first allocation in a context, and doubles that size for each
> > successive block request. But we can specify the initial block size
> > and max blocksize. This made me think of another idea to specify both
> > to DSA and both values are calculated based on m_w_m. For example, we
>
> That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.

I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable. The patch includes a test
function, test_memory_usage() to simulate how DSA segments grow behind
the shared radix tree. If we set the first argument to true, it
calculates both initial and maximum block size based on work_mem (I
used work_mem here just because its value range is larger than m_w_m):

postgres(1:833654)=# select test_memory_usage(true);
NOTICE: memory limit 134217728
NOTICE: init 1048576 max 16777216
NOTICE: initial: 1048576
NOTICE: rt_create: 1048576
NOTICE: allocate new DSM [1] 1048576
NOTICE: allocate new DSM [2] 2097152
NOTICE: allocate new DSM [3] 2097152
NOTICE: allocate new DSM [4] 4194304
NOTICE: allocate new DSM [5] 4194304
NOTICE: allocate new DSM [6] 8388608
NOTICE: allocate new DSM [7] 8388608
NOTICE: allocate new DSM [8] 16777216
NOTICE: allocate new DSM [9] 16777216
NOTICE: allocate new DSM [10] 16777216
NOTICE: allocate new DSM [11] 16777216
NOTICE: allocate new DSM [12] 16777216
NOTICE: allocate new DSM [13] 16777216
NOTICE: allocate new DSM [14] 16777216
NOTICE: reached: 148897792 (+14680064)
NOTICE: 12718205 keys inserted: 148897792
test_memory_usage
-------------------

(1 row)

Time: 7195.664 ms (00:07.196)

Setting the first argument to false, we can specify both manually in
second and third arguments:

postgres(1:833654)=# select test_memory_usage(false, 1024 * 1024, 1024
* 1024 * 1024 * 10::bigint);
NOTICE: memory limit 134217728
NOTICE: init 1048576 max 10737418240
NOTICE: initial: 1048576
NOTICE: rt_create: 1048576
NOTICE: allocate new DSM [1] 1048576
NOTICE: allocate new DSM [2] 2097152
NOTICE: allocate new DSM [3] 2097152
NOTICE: allocate new DSM [4] 4194304
NOTICE: allocate new DSM [5] 4194304
NOTICE: allocate new DSM [6] 8388608
NOTICE: allocate new DSM [7] 8388608
NOTICE: allocate new DSM [8] 16777216
NOTICE: allocate new DSM [9] 16777216
NOTICE: allocate new DSM [10] 33554432
NOTICE: allocate new DSM [11] 33554432
NOTICE: allocate new DSM [12] 67108864
NOTICE: reached: 199229440 (+65011712)
NOTICE: 12718205 keys inserted: 199229440
test_memory_usage
-------------------

(1 row)

Time: 7187.571 ms (00:07.188)

It seems to work fine. The differences between the above two cases is
the maximum block size (16MB .vs 10GB). We allocated two more DSA
segments in the first segments but there was no big difference in the
performance in my test environment.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
dsa_init_max_block_size.patch.txt text/plain 13.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-03-20 14:37:36 Re: Save a few bytes in pg_attribute
Previous Message Tom Lane 2023-03-20 14:31:14 Re: Question: Do we have a rule to use "PostgreSQL" and "<productname>PostgreSQL</productname>" separately?