Re: stats collector suddenly causing lots of IO

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Kupershmidt <schmiddy(at)gmail(dot)com>, Chris <lists(at)deksai(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: stats collector suddenly causing lots of IO
Date: 2010-04-16 17:20:54
Message-ID: C449E205-8B0B-4024-B647-4CC6F2C46CA4@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Apr 16, 2010, at 9:48 AM, Tom Lane wrote:

> Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
>> On Fri, Apr 16, 2010 at 11:41 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Wow. Well, we have a smoking gun here: for some reason, autovacuum
>>> isn't running, or isn't doing its job if it is. If it's not running
>>> at all, that would explain failure to prune the stats collector's file
>>> too.
>
>> Hrm, well autovacuum is at least trying to do work: it's currently
>> stuck on those bloated pg_catalog tables, of course. Another developer
>> killed an autovacuum of pg_attribute (or maybe it was pg_attrdef)
>> after it had been running for two weeks. See current pg_stat_activity
>> output attached, which shows the three autovacuum workers running plus
>> two manual VACUUM ANALYZEs I started yesterday.
>
> Two weeks? What have you got the autovacuum cost delays set to?
>
> Once you're up to three AV workers, no new ones can get launched until
> one of those finishes or is killed. So that would explain failure to
> prune the stats collector's tables (the tabpurge code is only run during
> AV worker launch). So what we need to figure out is why it's taking so
> obscenely long to vacuum these tables ...
>

On any large system with good I/O I have had to significantly increase the aggressiveness of autovacuum.
Even with the below settings, it doesn't interfere with other activity (~2200iops random, ~900MB/sec sequential capable I/O).

My relevant autovacuum parameters are (from 'show *'):
autovacuum | on | Starts the autovacuum subprocess.
autovacuum_analyze_scale_factor | 0.1 | Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples.
autovacuum_analyze_threshold | 50 | Minimum number of tuple inserts, updates or deletes prior to analyze.
autovacuum_freeze_max_age | 200000000 | Age at which to autovacuum a table to prevent transaction ID wraparound.
autovacuum_max_workers | 3 | Sets the maximum number of simultaneously running autovacuum worker processes.
autovacuum_naptime | 1min | Time to sleep between autovacuum runs.
autovacuum_vacuum_cost_delay | 20ms | Vacuum cost delay in milliseconds, for autovacuum.
autovacuum_vacuum_cost_limit | 2000 | Vacuum cost amount available before napping, for autovacuum.
autovacuum_vacuum_scale_factor | 0.2 | Number of tuple updates or deletes prior to vacuum as a fraction of reltuples.
autovacuum_vacuum_threshold | 50

For what it is worth, I just went onto one of my systems -- one with lots of partition tables and temp table creation/destruction -- and looked at the system tables in question there.

Postgres 8.4, using dt+ (trimmed result below to interesting tables)

Schema | Name | Type | Owner | Size | Description
------------+-------------------------+-------+----------+------------+-------------
pg_catalog | pg_attrdef | table | postgres | 195 MB |
pg_catalog | pg_attribute | table | postgres | 1447 MB |
pg_catalog | pg_class | table | postgres | 1694 MB |
pg_catalog | pg_constraint | table | postgres | 118 MB |
pg_catalog | pg_depend | table | postgres | 195 MB |
pg_catalog | pg_statistic | table | postgres | 2300 MB |
pg_catalog | pg_type | table | postgres | 181 MB |

So, I did a vacuum full; reindex table; analyze; sequence on each of these. I wish I could just CLUSTER them but the above works.

now the tables are:
Schema | Name | Type | Owner | Size | Description
------------+-------------------------+-------+----------+------------+-------------
pg_catalog | pg_attrdef | table | postgres | 44 MB |
pg_catalog | pg_attribute | table | postgres | 364 MB |
pg_catalog | pg_class | table | postgres | 1694 MB |
pg_catalog | pg_constraint | table | postgres | 118 MB |
pg_catalog | pg_depend | table | postgres | 195 MB |
pg_catalog | pg_statistic | table | postgres | 656 MB |
pg_catalog | pg_type | table | postgres | 45 MB |

I've learned to accept about 50% bloat (2x the compacted size) in postgres as just the way it usually is on a busy table, but the 3x and 4x bloat of statistic, attrdef, and attribute have me wondering.

I have had some 'idle in transaction' connections hanging out from time to time that have caused issues on this machine that could explain the above perma-bloat. That is one thing that could affect the case reported here as well. The worst thing about those, is you can't even force kill those connections from within postgres (pg_cancel_backend doesn't work on them, and killing them via the OS bounces postgres ...) so you have to hunt down the offending client.

> regards, tom lane
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2010-04-16 17:24:08 Re: stats collector suddenly causing lots of IO
Previous Message Josh Berkus 2010-04-16 16:59:16 Re: Autovaccum with cost_delay does not complete on one solaris 5.10 machine