Skip site navigation (1) Skip section navigation (2)

Re: autovacuum strategy / parameters

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rick <richard(dot)branton(at)ca(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: autovacuum strategy / parameters
Date: 2010-04-28 00:55:11
Message-ID: h2o603c8f071004271755v3e56a408hbd23b5c355058aa@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-performance
On Thu, Apr 22, 2010 at 4:42 PM, Rick <richard(dot)branton(at)ca(dot)com> wrote:
> On Apr 22, 2:55 pm, robertmh(dot)(dot)(dot)(at)gmail(dot)com (Robert Haas) wrote:
>> On Wed, Apr 21, 2010 at 11:06 AM, Rick <richard(dot)bran(dot)(dot)(dot)(at)ca(dot)com> wrote:
>> > I have a DB with small and large tables that can go up to 15G.
>> > For performance benefits, it appears that analyze has much less cost
>> > than vacuum, but the same benefits?
>>
>> Err, no.  ANALYZE gathers statistics for the query planner; VACUUM
>> clears out old, dead tuples so that space can be reused by the
>> database system.
>>
>> > I can’t find any clear recommendations for frequencies and am
>> > considering these parameters:
>>
>> > Autovacuum_vacuum_threshold = 50000
>> > Autovacuum_analyze_threshold = 10000
>> > Autovacuum_vacuum_scale_factor = 0.01
>> > Autovacuum_analyze_scale_factor = 0.005
>>
>> > This appears it will result in table analyzes occurring around 10,000
>> > to 85,000 dead tuples and vacuum occuring around 50,000 to 200,000,
>> > depending on the table sizes.
>>
>> > Can anyone comment on whether this is the right strategy and targets
>> > to use?
>>
>> I'm not that familiar with tuning these parameters but increasing the
>> default thesholds by a thousand-fold doesn't seem like a good idea.
>> Small tables will never get vacuumed or analyzed at all.
>>
>> ...Robert
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performa(dot)(dot)(dot)(at)postgresql(dot)org)
>> To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-performance
>
> The problem is with the autovacuum formula:
>
> In a loop, autovacuum checks to see if number of dead tuples >
> ((number of live tuples * autovacuum_vacuum_scale_factor) +
> autovacuum_vacuum_threshold), and if
> so, it runs VACUUM. If not, it sleeps. It works the same way for
> ANALYZE.
>
> So, in a large table, the scale_factor is the dominant term. In a
> small
> table, the threshold is the dominant term. But both are taken into
> account.
>
> The default values are set for small tables; it is not being run for
> large tables.
> The question boils down to exactly what is the max number of dead
> tuples that should be allowed to accumulate before running analyze?
> Since vacuum just recovers space, that doesn't seem to be nearly as
> critical for performance?

That doesn't really match my experience.  Without regular vacuuming,
tables and indices end up being larger than they ought to be and
contain large amounts of dead space that slows things down.  How much
of an impact that ends up having depends on how badly bloated they are
and what you're trying to do, but it can get very ugly.

My guess is that the reason we run ANALYZE more frequently than vacuum
(with the default settings) is that ANALYZE is pretty cheap.  In many
cases, if the statistical distribution of the data hasn't changed
much, then it's not really necessary, but it doesn't cost much either.
 And for certain types of usage patterns, like time series (where the
maximum value keeps increasing) it's REALLY important to analyze
frequently.

But having said that, on the systems I've worked with, I've only
rarely seen a problem caused by not analyzing frequently enough.  On
the other hand, I've seen MANY problems caused by not vacuuming
enough.  Someone runs a couple of big queries that rewrite a large
portion of a table several times over and, boom, problems.  8.3 and
higher are better about this because of an optimization called HOT,
but there can still be problems.

Other people's experiences may not match mine, but the bottom line is
that you need to do both of these things, and you need to make sure
they happen regularly.  In most cases, the CPU and I/O time they
consume will be amply repaid in improved query performance.

...Robert

In response to

Responses

pgsql-performance by date

Next:From: Robert HaasDate: 2010-04-28 00:56:56
Subject: Re: tmpfs and postgres memory
Previous:From: Robert HaasDate: 2010-04-28 00:46:36
Subject: Re: Optimization idea

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group