Re: autovacuum stress-testing our system

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: autovacuum stress-testing our system
Date: 2012-11-18 22:49:19
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 26.9.2012 19:18, Jeff Janes wrote:
> On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
>>> Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012:
>>>> On 26-09-2012 09:43, Tomas Vondra wrote:
>>>>> 5) splitting the single stat file into multiple pieces - e.g. per database,
>>>>> written separately, so that the autovacuum workers don't need to read all
>>>>> the data even for databases that don't need to be vacuumed. This might be
>>>>> combined with (4).
>>>> IMHO that's the definitive solution. It would be one file per database plus a
>>>> global one. That way, the check would only read the global.stat and process
>>>> those database that were modified. Also, an in-memory map could store that
>>>> information to speed up the checks.
>>> +1
>> That would help for the case of hundreds of databases, but how much
>> does it help for lots of tables in a single database?
> It doesn't help that case, but that case doesn't need much help. If
> you have N statistics-kept objects in total spread over M databases,
> of which T objects need vacuuming per naptime, the stats file traffic
> is proportional to N*(M+T). If T is low, then there is generally is
> no problem if M is also low. Or at least, the problem is much smaller
> than when M is high for a fixed value of N.

I've done some initial hacking on splitting the stat file into multiple
smaller pieces over the weekend, and it seems promising (at least with
respect to the issues we're having).

See the patch attached, but be aware that this is a very early WIP (or
rather a proof of concept), so it has many rough edges (read "sloppy
coding"). I haven't even added it to the commitfest yet ...

The two main changes are these:

(1) The stats file is split into a common "db" file, containing all the
DB Entries, and per-database files with tables/functions. The common
file is still called "pgstat.stat", the per-db files have the
database OID appended, so for example "pgstat.stat.12345" etc.

This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
functions, introducing two new functions:


that do the trick of reading/writing stat file for one database.

(2) The pgstat_read_statsfile has an additional parameter "onlydbs" that
says that you don't need table/func stats - just the list of db
entries. This is used for autovacuum launcher, which does not need
to read the table/stats (if I'm reading the code in autovacuum.c
correctly - it seems to be working as expected).

So what are the benefits?

(a) When a launcher asks for info about databases, something like this
is called in the end:

pgstat_read_db_statsfile(InvalidOid, false, true)

which means all databases (InvalidOid) and only db info (true). So
it reads only the one common file with db entries, not the
table/func stats.

(b) When a worker asks for stats for a given DB, something like this is
called in the end:

pgstat_read_db_statsfile(MyDatabaseId, false, false)

which reads only the common stats file (with db entries) and only
one file for the one database.

The current implementation (with the single pgstat.stat file), all
the data had to be read (and skipped silently) in both cases.
That's a lot of CPU time, and we're seeing ~60% of CPU spent on
doing just this (writing/reading huge statsfile).

So with a lot of databases/objects, this "pgstat.stat split" saves
us a lot of CPU ...

(c) This should lower the space requirements too - with a single file,
you actually need at least 2x the disk space (or RAM, if you're
using tmpfs as we are), because you need to keep two versions of
the file at the same time (pgstat.stat and pgstat.tmp).

Thanks to this split you only need additional space for a copy of
the largest piece (with some reasonable safety reserve).

Well, it's very early patch, so there are rough edges too

(a) It does not solve the "many-schema" scenario at all - that'll need
a completely new approach I guess :-(

(b) It does not solve the writing part at all - the current code uses a
single timestamp (last_statwrite) to decide if a new file needs to
be written.

That clearly is not enough for multiple files - there should be one
timestamp for each database/file. I'm thinking about how to solve
this and how to integrate it with pgstat_send_inquiry etc.

One way might be adding the timestamp(s) into PgStat_StatDBEntry
and the other one is using an array of inquiries for each database.

And yet another one I'm thinking about is using a fixed-length
array of timestamps (e.g. 256), indexed by mod(dboid,256). That
would mean stats for all databases with the same mod(oid,256) would
be written at the same time. Seems like an over-engineering though.

(c) I'm a bit worried about the number of files - right now there's one
for each database and I'm thinking about splitting them by type
(one for tables, one for functions) which might make it even faster
for some apps with a lot of stored procedures etc.

But is the large number of files actually a problem? After all,
we're using one file per relation fork in the "base" directory, so
this seems like a minor issue.

And if really an issue, this might be solved by the mod(oid,256) to
combine multiple files into one (which would work neatly with the
fixed-length array of timestamps).

kind regards

Attachment Content-Type Size
stats-split.patch text/plain 11.3 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Tiikkaja 2012-11-18 23:05:26 Re: pg_dump --split patch
Previous Message Alexander Korotkov 2012-11-18 21:54:53 WIP: store additional info in GIN index