Re: Vacuum: allow usage of more than 1GB of work mem

From: Andres Freund <andres(at)anarazel(dot)de>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Vacuum: allow usage of more than 1GB of work mem
Date: 2017-04-08 01:12:52
Message-ID: 20170408011252.36w6tzgruxw7qn5t@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-04-07 22:06:13 -0300, Claudio Freire wrote:
> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Hi,
> >
> >
> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> > Hi,
> >> >
> >> > I've *not* read the history of this thread. So I really might be
> >> > missing some context.
> >> >
> >> >
> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> >> >> From: Claudio Freire <klaussfreire(at)gmail(dot)com>
> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> >> >>
> >> >> Turn the dead_tuples array into a structure composed of several
> >> >> exponentially bigger arrays, to enable usage of more than 1GB
> >> >> of work mem during vacuum and thus reduce the number of full
> >> >> index scans necessary to remove all dead tids when the memory is
> >> >> available.
> >> >
> >> >> * We are willing to use at most maintenance_work_mem (or perhaps
> >> >> * autovacuum_work_mem) memory space to keep track of dead tuples. We
> >> >> - * initially allocate an array of TIDs of that size, with an upper limit that
> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
> >> >> * depends on table size (this limit ensures we don't allocate a huge area
> >> >> - * uselessly for vacuuming small tables). If the array threatens to overflow,
> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page
> >> >> - * compaction, then resume the heap scan with an empty TID array.
> >> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly
> >> >> + * large sizes are allocated as they become necessary.
> >> >> + *
> >> >> + * The TID array is thus represented as a list of multiple segments of
> >> >> + * varying size, beginning with the initial size of up to 128MB, and growing
> >> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
> >> >> + * is used up.
> >> >
> >> > When the chunk size is 128MB, I'm a bit unconvinced that using
> >> > exponential growth is worth it. The allocator overhead can't be
> >> > meaningful in comparison to collecting 128MB dead tuples, the potential
> >> > waste is pretty big, and it increases memory fragmentation.
> >>
> >> The exponential strategy is mainly to improve lookup time (ie: to
> >> avoid large segment lists).
> >
> > Well, if we were to do binary search on the segment list, that'd not be
> > necessary.
>
> True, but the initial lookup might be slower in the end, since the
> array would be bigger and cache locality worse.
>
> Why do you say exponential growth fragments memory? AFAIK, all those
> allocations are well beyond the point where malloc starts mmaping
> memory, so each of those segments should be a mmap segment,
> independently freeable.

Not all platforms have that, and even on platforms with it, frequent,
unevenly sized, very large allocations can lead to enough fragmentation
that further allocations are harder and fragment / enlarge the
pagetable.

> >> Yes, the benchmarks are upthread. The earlier runs were run on my
> >> laptop and made little sense, so I'd ignore them as inaccurate. The
> >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
> >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
> >> improvements as well, roughly 30%, though it's not measuring CPU time
> >> but rather elapsed time.
> >
> > I'd be more concerned about cases that'd already fit into memory, not ones
> > where we avoid doing another scan - and I think you mostly measured that?
> >
> > - Andres
>
> Well, scale 400 is pretty much as big as you can get with the old 1GB
> limit, and also suffered no significant regression. Although, true, id
> didn't significantly improve either.

Aren't more interesting cases those where not that many dead tuples are
found, but the indexes are pretty large? IIRC the index vacuum scans
still visit every leaf index tuple, no?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Keith Fiske 2017-04-08 01:18:46 Re: Partitioned tables vs GRANT
Previous Message Claudio Freire 2017-04-08 01:06:13 Re: Vacuum: allow usage of more than 1GB of work mem