Re: Proposal: Another attempt at vacuum improvements

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Another attempt at vacuum improvements
Date: 2011-05-25 12:27:19
Message-ID: BANLkTi=hX8i3Ma4XpCCrOp-mEsi7zCpfhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 25, 2011 at 7:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Tue, May 24, 2011 at 7:58 AM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
>> The biggest gripe today is that vacuum needs two heap scans and each scan
>> dirties the buffer.
>
> That's not that clear to me. The debate usually stalls because we
> don't have sufficient info from real world analysis of where the time
> goes.

It probably wouldn't be too hard to write a simple patch to measure
time spent during the first heap pass, time spent scanning indexes,
and time spent on the second heap pass. But it's not just about where
the time goes: as Pavan says, the second heap pass feels like a waste.
Actually, the time spent scanning the indexes kinda feels like a
waste too, if the number of tuples being removed is small. Which
brings me to your second point:

>> So the idea is to separate the index vacuum (removing index pointers to dead
>> tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
>> or using regular vacuum), we can spool the dead line pointers somewhere.
>
> ISTM it will be complex to attempt to store the exact list of TIDs
> between VACUUMs.
>
> At the moment we scan indexes if we have > 0 rows to remove, which is
> probably wasteful. Perhaps it would be better to keep a running total
> of rows to remove, by updating pg_stats, then when we hit a certain
> threshold in total we can do the index scan. So we don't need to
> remember the TIDs, just remember how many there were and use that to
> avoid cleaning too vigorously.

That occurred to me, too. If we're being launched by autovacuum then
we know that a number of updates and deletes equal ~20% (or whatever
autovacuum_vacuum_scale_factor is set to) of the table size have
occurred since the last autovacuum. But it's possible that many of
those were HOT updates, in which case the number of index entries to
be cleaned up might be much less than 20% of the table size.
Alternatively, it's possible that we'd be better off vacuuming the
table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
something) but only doing the index scans every once in a while when
enough dead line pointers have accumulated. After all, it's the first
heap pass that frees up most of the space; cleaning dead line pointers
seems a bit less urgent. But not having done any real analysis of how
this would work out in practice, I'm not sure whether it's a good idea
or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2011-05-25 12:27:47 Re: Reducing overhead of frequent table locks
Previous Message Heikki Linnakangas 2011-05-25 12:21:33 Re: Nested CASE-WHEN scoping