autovacuum stress-testing our system

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: autovacuum stress-testing our system
Date: 2012-09-26 12:43:30
Message-ID: 1718942738eb65c8407fcd864883f4c8@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've been struggling with autovacuum generating a lot of I/O and CPU on
some of our
systems - after a night spent analyzing this behavior, I believe the
current
autovacuum accidentally behaves a bit like a stress-test in some corner
cases (but
I may be seriously wrong, after all it was a long night).

First - our system really is not a "common" one - we do have ~1000 of
databases of
various size, each containing up to several thousands of tables
(several user-defined
tables, the rest serve as caches for a reporting application - yes,
it's a bit weird
design but that's life). This all leads to pgstat.stat significantly
larger than 60 MB.

Now, the two main pieces of information from the pgstat.c are the timer
definitions

---------------------------------- pgstat.c : 80
----------------------------------

#define PGSTAT_STAT_INTERVAL 500 /* Minimum time between stats
file
* updates; in milliseconds. */

#define PGSTAT_RETRY_DELAY 10 /* How long to wait between
checks for
* a new file; in milliseconds.
*/

#define PGSTAT_MAX_WAIT_TIME 10000 /* Maximum time to wait for a
stats
* file update; in milliseconds.
*/

#define PGSTAT_INQ_INTERVAL 640 /* How often to ping the
collector for
* a new file; in milliseconds.
*/

#define PGSTAT_RESTART_INTERVAL 60 /* How often to attempt to
restart a
* failed statistics collector;
in
* seconds. */

#define PGSTAT_POLL_LOOP_COUNT (PGSTAT_MAX_WAIT_TIME /
PGSTAT_RETRY_DELAY)
#define PGSTAT_INQ_LOOP_COUNT (PGSTAT_INQ_INTERVAL /
PGSTAT_RETRY_DELAY)

-----------------------------------------------------------------------------------

and then this loop (the current HEAD does this a bit differently, but
the 9.2 code
is a bit readable and suffers the same issue):

---------------------------------- pgstat.c : 3560
--------------------------------

/*
* Loop until fresh enough stats file is available or we ran out of
time.
* The stats inquiry message is sent repeatedly in case collector
drops
* it; but not every single time, as that just swamps the collector.
*/
for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
{
TimestampTz file_ts = 0;

CHECK_FOR_INTERRUPTS();

if (pgstat_read_statsfile_timestamp(false, &file_ts) &&
file_ts >= min_ts)
break;

/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
pgstat_send_inquiry(min_ts);

pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}

if (count >= PGSTAT_POLL_LOOP_COUNT)
elog(WARNING, "pgstat wait timeout");

/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
else
pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);

-----------------------------------------------------------------------------------

What this code does it that it checks the statfile, and if it's not
stale (the
timestamp of the write start is not older than PGSTAT_RETRY_DELAY
milliseconds),
the loop is terminated and the file is read.

Now, let's suppose the write takes >10 ms, which is the
PGSTAT_RETRY_DELAY values.
With our current pgstat.stat filesize/num of relations, this is quite
common.
Actually the common write time in our case is ~100 ms, even if we move
the file
into tmpfs. That means that almost all the calls to
backend_read_statsfile (which
happen in all pgstat_fetch_stat_*entry calls) result in continuous
stream of
inquiries from the autovacuum workers, writing/reading of the file.

We're not getting 'pgstat wait timeout' though, because it finally gets
written
before PGSTAT_MAX_WAIT_TIME.

By moving the file to a tmpfs we've minimized the I/O impact, but now
the collector
and autovacuum launcher consume ~75% of CPU (i.e. ~ one core) and do
nothing except
burning power because the database is almost read-only. Not a good
thing in the
"green computing" era I guess.

First, I'm interested in feedback - did I get all the details right, or
am I
missing something important?

Next, I'm thinking about ways to solve this:

1) turning of autovacuum, doing regular VACUUM ANALYZE from cron -
certainly an
option, but it's rather a workaround than a solution and I'm not
very fond of
it. Moreover it fixes only one side of the problem - triggering the
statfile
writes over and over. The file will be written anyway, although not
that
frequently.

2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY
and so on
to consider several seconds to be fresh enough - Would be nice to
have this
as a GUC variables, although we can do another private patch on our
own. But
more knobs is not always better.

3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly
on the time
it takes to write the file (e.g. 10x the write time or something).

4) keeping some sort of "dirty flag" in stat entries - and then writing
only info
about objects were modified enough to be eligible for vacuum/analyze
(e.g.
increasing number of index scans can't trigger autovacuum while
inserting
rows can). Also, I'm not worried about getting a bit older num of
index scans,
so 'clean' records might be written less frequently than 'dirty'
ones.

5) splitting the single stat file into multiple pieces - e.g. per
database,
written separately, so that the autovacuum workers don't need to
read all
the data even for databases that don't need to be vacuumed. This
might be
combined with (4).

Ideas? Objections? Preferred options?

I kinda like (4+5), although that'd be a pretty big patch and I'm not
entirely
sure it can be done without breaking other things.

regards
Tomas

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2012-09-26 13:12:29 Re: Oid registry
Previous Message Michael Paquier 2012-09-26 12:39:36 Re: pg_reorg in core?