Re: autovacuum stress-testing our system

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: autovacuum stress-testing our system
Date: 2012-09-26 23:53:50
Message-ID: 5063958E.1080606@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26.9.2012 18:14, Jeff Janes wrote:
> On Wed, Sep 26, 2012 at 8:25 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> Dne 26.09.2012 16:51, Jeff Janes napsal:
>>
>>
>>> What is generating the endless stream you are seeing is that you have
>>> 1000 databases so if naptime is one minute you are vacuuming 16 per
>>> second. Since every database gets a new process, that process needs
>>> to read the file as it doesn't inherit one.
>>
>>
>> Right. But that makes the 10ms timeout even more strange, because the
>> worker is then using the data for very long time (even minutes).
>
> On average that can't happen, or else your vacuuming would fall way
> behind. But I agree, there is no reason to have very fresh statistics
> to start with. naptime/5 seems like a good cutoff for me for the
> start up reading. If a table only becomes eligible for vacuuming in
> the last 20% of the naptime, I see no reason that it can't wait
> another round. But that just means the statistics collector needs to
> write the file less often, the workers still need to read it once per
> database since each one only vacuums one database and don't inherit
> the data from the launcher.

So what happens if there are two workers vacuuming the same database?
Wouldn't that make it more probable that were already vacuumed by the
other worker?

See the comment at the beginning of autovacuum.c, where it also states
that the statfile is reloaded before each table (probably because of the
calls to autovac_refresh_stats which in turn calls clear_snapshot).

>>> I think forking it off to to another value would be better. If you
>>> are an autovacuum worker which is just starting up and so getting its
>>> initial stats, you can tolerate a stats file up to "autovacuum_naptime
>>> / 5.0" stale. If you are already started up and are just about to
>>> vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it
>>> currently is, so as not to redundantly vacuum a table.
>>
>>
>> I always thought there's a "no more than one worker per database" limit,
>> and that the file is always reloaded when switching to another database.
>> So I'm not sure how could a worker see such a stale table info? Or are
>> the workers keeping the stats across multiple databases?
>
> If you only have one "active" database, then all the workers will be
> in it. I don't how likely it is that they will leap frog each other
> and collide. But anyway, if you 1000s of databases, then each one
> will generally require zero vacuums per naptime (as you say, it is
> mostly read only), so it is the reads upon start-up, not the reads per
> table that needs vacuuming, which generates most of the traffic. Once
> you separate those two parameters out, playing around with the
> PGSTAT_RETRY_DELAY one seems like a needless risk.

OK, right. My fault.

Yes, our databases are mostly readable - more precisely whenever we load
data, we immediately do VACUUM ANALYZE on the tables, so autovacuum
never kicks in on them. The only thing it works on are system catalogs
and such stuff.

>>>> 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on
>>>> the
>>>> time
>>>> it takes to write the file (e.g. 10x the write time or something).
>>>
>>>
>>> This is already in place.
>>
>>
>> Really? Where?
>
> I had thought that this part was effectively the same thing:
>
> * We don't recompute min_ts after sleeping, except in the
> * unlikely case that cur_ts went backwards.
>
> But I think I did not understand your proposal.
>
>>
>> I've checked the current master, and the only thing I see in
>> pgstat_write_statsfile
>> is this (line 3558):
>>
>> last_statwrite = globalStats.stats_timestamp;
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgstat.c#L3558
>>
>>
>> I don't think that's doing what I meant. That really doesn't scale the
>> timeout
>> according to write time. What happens right now is that when the stats file
>> is
>> written at time 0 (starts at zero, write finishes at 100 ms), and a worker
>> asks
>> for the file at 99 ms (i.e. 1ms before the write finishes), it will set the
>> time
>> of the inquiry to last_statrequest and then do this
>>
>> if (last_statwrite < last_statrequest)
>> pgstat_write_statsfile(false);
>>
>> i.e. comparing it to the start of the write. So another write will start
>> right
>> after the file is written out. And over and over.
>
> Ah. I had wondered about this before too, and wondered if it would be
> a good idea to have it go back to the beginning of the stats file, and
> overwrite the timestamp with the current time (rather than the time it
> started writing it), as the last action it does before the rename. I
> think that would automatically make it adaptive to the time it takes
> to write out the file, in a fairly simple way.

Yeah, I was thinking about that too.

>> Moreover there's the 'rename' step making the new file invisible for the
>> worker
>> processes, which makes the thing a bit more complicated.
>
> I think renames are assumed to be atomic. Either it sees the old one,
> or the new one, but never sees neither.

I'm not quite sure what I meant, but not this - I know the renames are
atomic. I probably haven't noticed that inquiries are using min_ts, so I
though that an inquiry sent right after the write starts (with min_ts
before the write) would trigger another write, but that's not the case.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2012-09-27 00:21:51 Re: Does the SQL standard actually define LATERAL anywhere?
Previous Message Tomas Vondra 2012-09-26 22:48:34 Re: autovacuum stress-testing our system