Re: autovacuum stress-testing our system

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovacuum stress-testing our system
Date: 2012-09-26 15:25:58
Message-ID: 1c9a6358f18915d74e375746b0f6e311@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 26.09.2012 16:51, Jeff Janes napsal:
> On Wed, Sep 26, 2012 at 5:43 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>
>> First - our system really is not a "common" one - we do have ~1000
>> of
>> databases of
>> various size, each containing up to several thousands of tables
>> (several
>> user-defined
>> tables, the rest serve as caches for a reporting application - yes,
>> it's a
>> bit weird
>> design but that's life). This all leads to pgstat.stat significantly
>> larger
>> than 60 MB.
>>
> ...
>> Now, let's suppose the write takes >10 ms, which is the
>> PGSTAT_RETRY_DELAY
>> values.
>> With our current pgstat.stat filesize/num of relations, this is
>> quite
>> common.
>> Actually the common write time in our case is ~100 ms, even if we
>> move the
>> file
>> into tmpfs. That means that almost all the calls to
>> backend_read_statsfile
>> (which
>> happen in all pgstat_fetch_stat_*entry calls) result in continuous
>> stream of
>> inquiries from the autovacuum workers, writing/reading of the file.
>
> I don't think it actually does. What you are missing is the same
> thing I was missing a few weeks ago when I also looked into something
> like this.
>
> 3962:
>
> * We don't recompute min_ts after sleeping, except in
> the
> * unlikely case that cur_ts went backwards.
>
> That means the file must have been written within 10 ms of when we
> *first* asked for it.

Yeah, right - I've missed the first "if (pgStatDBHash)" check right at
the beginning.

> What is generating the endless stream you are seeing is that you have
> 1000 databases so if naptime is one minute you are vacuuming 16 per
> second. Since every database gets a new process, that process needs
> to read the file as it doesn't inherit one.

Right. But that makes the 10ms timeout even more strange, because the
worker is then using the data for very long time (even minutes).

>
> ...
>>
>> First, I'm interested in feedback - did I get all the details right,
>> or am I
>> missing something important?
>>
>> Next, I'm thinking about ways to solve this:
>>
>> 1) turning of autovacuum, doing regular VACUUM ANALYZE from cron
>
> Increasing autovacuum_naptime seems like a far better way to do
> effectively the same thing.

Agreed. One of my colleagues turned autovacuum off a few years back and
that
was a nice lesson how not to solve this kind of issues.

>> 2) tweaking the timer values, especially increasing
>> PGSTAT_RETRY_DELAY and
>> so on
>> to consider several seconds to be fresh enough - Would be nice to
>> have
>> this
>> as a GUC variables, although we can do another private patch on
>> our own.
>> But
>> more knobs is not always better.
>
> I think forking it off to to another value would be better. If you
> are an autovacuum worker which is just starting up and so getting its
> initial stats, you can tolerate a stats file up to
> "autovacuum_naptime
> / 5.0" stale. If you are already started up and are just about to
> vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it
> currently is, so as not to redundantly vacuum a table.

I always thought there's a "no more than one worker per database"
limit,
and that the file is always reloaded when switching to another
database.
So I'm not sure how could a worker see such a stale table info? Or are
the workers keeping the stats across multiple databases?

>> 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based
>> mostly on the
>> time
>> it takes to write the file (e.g. 10x the write time or
>> something).
>
> This is already in place.

Really? Where?

I've checked the current master, and the only thing I see in
pgstat_write_statsfile
is this (line 3558):

last_statwrite = globalStats.stats_timestamp;

https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgstat.c#L3558

I don't think that's doing what I meant. That really doesn't scale the
timeout
according to write time. What happens right now is that when the stats
file is
written at time 0 (starts at zero, write finishes at 100 ms), and a
worker asks
for the file at 99 ms (i.e. 1ms before the write finishes), it will set
the time
of the inquiry to last_statrequest and then do this

if (last_statwrite < last_statrequest)
pgstat_write_statsfile(false);

i.e. comparing it to the start of the write. So another write will
start right
after the file is written out. And over and over.

Moreover there's the 'rename' step making the new file invisible for
the worker
processes, which makes the thing a bit more complicated.

What I'm suggesting it that there should be some sort of tracking the
write time
and then deciding whether the file is fresh enough using 10x that
value. So when
a file is written in 100 ms, it's be considered OK for the next 900 ms,
i.e. 1 sec
in total. Sure, we could use 5x or other coefficient, doesn't really
matter.

>> 5) splitting the single stat file into multiple pieces - e.g. per
>> database,
>> written separately, so that the autovacuum workers don't need to
>> read all
>> the data even for databases that don't need to be vacuumed. This
>> might be
>> combined with (4).
>
> I think this needs to happen eventually.

Yes, a nice patch idea ;-)

thanks for the feedback

Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2012-09-26 15:27:44 Re: autovacuum stress-testing our system
Previous Message Euler Taveira 2012-09-26 15:14:26 Re: system_information.triggers & truncate triggers