BUG #5759: Autovacuum cost limit trends to 1, using default config

From: "Simon Poole" <postgresql(at)simon(dot)themalago(dot)net>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #5759: Autovacuum cost limit trends to 1, using default config
Date: 2010-11-19 15:32:59
Message-ID: 201011191532.oAJFWxR2095324@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 5759
Logged by: Simon Poole
Email address: postgresql(at)simon(dot)themalago(dot)net
PostgreSQL version: 8.4.4
Operating system: RHEL5.3
Description: Autovacuum cost limit trends to 1, using default config
Details:

With the default autovacuum settings (3 workers, and default cost_limit of
200), all three workers start with a cost-limit of 66 each, but after each
table is completed, the cost-limit is reduced until they're all running with
a cost-limit of 1.

Here's what I'm seeing in my logs (with log_min_messages = 'debug')
# grep "vacuum\|analyz\|autovac\|scanned\|truncated"
postgresql-2010-11-19_135406.log

[2010-11-19 15:06:47.665 UTC] : DEBUG: autovac_balance_cost(pid=5080
db=305107164, rel=305109578, cost_limit=66, cost_delay=20)
[2010-11-19 15:06:47.665 UTC] : DEBUG: autovac_balance_cost(pid=6030
db=305107164, rel=305108600, cost_limit=66, cost_delay=20)
[2010-11-19 15:06:47.665 UTC] : DEBUG: autovac_balance_cost(pid=12682
db=305107164, rel=1247, cost_limit=66, cost_delay=20)
[2010-11-19 15:06:47.666 UTC] : DEBUG: vacuuming "pg_catalog.pg_type"
[2010-11-19 15:06:48.364 UTC] : DEBUG: scanned index "pg_type_oid_index" to
remove 198 row versions
[2010-11-19 15:06:48.565 UTC] : DEBUG: scanned index
"pg_type_typname_nsp_index" to remove 198 row versions
[2010-11-19 15:06:49.867 UTC] : DEBUG: "pg_type": truncated 83 to 59 pages
[2010-11-19 15:06:49.884 UTC] : DEBUG: analyzing "pg_catalog.pg_type"
[2010-11-19 15:06:50.119 UTC] : DEBUG: "pg_type": scanned 59 of 59 pages,
containing 1910 live rows and 14 dead rows; 1910 rows in sample, 1910
estimated total rows
[2010-11-19 15:06:50.196 UTC] : DEBUG: autovac_balance_cost(pid=5080
db=305107164, rel=305109578, cost_limit=28, cost_delay=20)
[2010-11-19 15:06:50.196 UTC] : DEBUG: autovac_balance_cost(pid=6030
db=305107164, rel=305108600, cost_limit=28, cost_delay=20)
[2010-11-19 15:06:50.196 UTC] : DEBUG: autovac_balance_cost(pid=12682
db=305107164, rel=2619, cost_limit=9, cost_delay=20)
[2010-11-19 15:06:50.196 UTC] : DEBUG: vacuuming "pg_catalog.pg_statistic"
[2010-11-19 15:07:23.041 UTC] : DEBUG: scanned index
"filmography_primary_id_idx" to remove 2506628 row versions
[2010-11-19 15:07:24.981 UTC] : DEBUG: scanned index
"pg_statistic_relid_att_index" to remove 105 row versions
[2010-11-19 15:07:27.190 UTC] : DEBUG: autovac_balance_cost(pid=5080
db=305107164, rel=305109578, cost_limit=4, cost_delay=20)
[2010-11-19 15:07:27.190 UTC] : DEBUG: autovac_balance_cost(pid=6030
db=305107164, rel=305108600, cost_limit=4, cost_delay=20)
[2010-11-19 15:07:27.190 UTC] : DEBUG: autovac_balance_cost(pid=12682
db=305107164, rel=1249, cost_limit=1, cost_delay=20)
[2010-11-19 15:07:27.190 UTC] : DEBUG: vacuuming "pg_catalog.pg_attribute"
[2010-11-19 15:08:14.787 UTC] : DEBUG: scanned index
"pg_attribute_relid_attnam_index" to remove 792 row versions
[2010-11-19 15:08:26.565 UTC] : DEBUG: scanned index
"pg_attribute_relid_attnum_index" to remove 792 row versions
[2010-11-19 15:08:44.995 UTC] : DEBUG: autovac_balance_cost(pid=5080
db=305107164, rel=305109578, cost_limit=1, cost_delay=20)
[2010-11-19 15:08:44.995 UTC] : DEBUG: autovac_balance_cost(pid=6030
db=305107164, rel=305108600, cost_limit=1, cost_delay=20)
[2010-11-19 15:08:44.995 UTC] : DEBUG: autovac_balance_cost(pid=12682
db=305107164, rel=1259, cost_limit=1, cost_delay=20)

If I send a SIGHUP to the launcher, it briefly jumps to 99/99/1, then
reverts to 1/1/1:
[2010-11-19 15:08:49.101 UTC] : DEBUG: autovac_balance_cost(pid=5080
db=305107164, rel=305109578, cost_limit=99, cost_delay=20)
[2010-11-19 15:08:49.101 UTC] : DEBUG: autovac_balance_cost(pid=6030
db=305107164, rel=305108600, cost_limit=99, cost_delay=20)
[2010-11-19 15:08:49.101 UTC] : DEBUG: autovac_balance_cost(pid=12682
db=305107164, rel=1259, cost_limit=1, cost_delay=20)
[2010-11-19 15:08:53.968 UTC] : DEBUG: scanned index "pg_class_oid_index"
to remove 198 row versions
[2010-11-19 15:08:58.326 UTC] : DEBUG: scanned index
"pg_class_relname_nsp_index" to remove 198 row versions
[2010-11-19 15:09:01.149 UTC] : DEBUG: analyzing "pg_catalog.pg_class"
[2010-11-19 15:09:03.981 UTC] : DEBUG: "pg_class": scanned 138 of 138
pages, containing 2781 live rows and 14 dead rows; 2781 rows in sample, 2781
estimated total rows
[2010-11-19 15:09:04.206 UTC] : DEBUG: autovac_balance_cost(pid=5080
db=305107164, rel=305109578, cost_limit=1, cost_delay=20)
[2010-11-19 15:09:04.206 UTC] : DEBUG: autovac_balance_cost(pid=6030
db=305107164, rel=305108600, cost_limit=1, cost_delay=20)
[2010-11-19 15:09:04.206 UTC] : DEBUG: autovac_balance_cost(pid=12682
db=305107164, rel=2606, cost_limit=1, cost_delay=20)

My vacuum config is all left at the defaults (i.e. commented out) except
for:
autovacuum = on
autovacuum_vacuum_scale_factor = 0.025
autovacuum_analyze_scale_factor = 0.050
vacuum_cost_delay = 50

If you start with those defaults set autovacuum_vacuum_cost_limit to 200,
then the problem goes away.

To reproduce, start everything up with autovacuum_vacuum_cost_limit=200.
You'll get three processes each with cost_limit=66. Then change that to -1
in the config, SIGHUP to reload, and then kill all the autovacuum workers.
The launcher will launch new worker processes with
autovacuum_vacuum_cost_limit=-1. After each table completes, you'll see
the cost_limit drop.

I've been reading the code to try and work out what's going on, and I think
I've found a bug.

In autovacuum.c , wi_cost_limit_base should always be initialised/refreshed
with the currently configured value. But it is derived from
tab->at_vacuum_cost_limit, which is initialised from VacuumCostLimit if
autovacuum_vacuum_cost_limit isn't explicitly set. At this point I
*think* VacuumCostLimit is normally set to the current wi_cost_limit of the
last worker in the runningworkers queue, rather than the currently
configured value, so wi_cost_limit_base is being initialised from the
current wi_cost_limit of the last worker in the queue.

Normally a worker starts with wi_cost_limit_base=200 and wi_cost_limit=66.
But after completing one table, wi_cost_limit_base gets set to 66 for that
worker, which affects the cost_limit calculation for all workers, and they
all drop. Pretty quickly they all reach 1.

I'm surprised this hasn't been spotted before, given that it happens with
the default values, but I can't find any references to it anywhere. FWIW I
spotted it because some vacuums were taking days to run, and strace showed
the workers sleeping for 80ms after each and every read.

The workaround is to explicitly set autovacuum_vacuum_cost_limit in your
config, and not rely on the default fallback to vacuum_cost_limit.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kevin Grittner 2010-11-19 19:10:55 Re: BUG #5758: Error during backup
Previous Message Sunil Vedd 2010-11-19 13:11:19 BUG #5758: Error during backup