Re: autovacuum stress-testing our system

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: autovacuum stress-testing our system
Date: 2012-12-03 02:07:43
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21.11.2012 19:02, Robert Haas wrote:
> On Sun, Nov 18, 2012 at 5:49 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> The two main changes are these:
>> (1) The stats file is split into a common "db" file, containing all the
>> DB Entries, and per-database files with tables/functions. The common
>> file is still called "pgstat.stat", the per-db files have the
>> database OID appended, so for example "pgstat.stat.12345" etc.
>> This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
>> functions, introducing two new functions:
>> pgstat_read_db_statsfile
>> pgstat_write_db_statsfile
>> that do the trick of reading/writing stat file for one database.
>> (2) The pgstat_read_statsfile has an additional parameter "onlydbs" that
>> says that you don't need table/func stats - just the list of db
>> entries. This is used for autovacuum launcher, which does not need
>> to read the table/stats (if I'm reading the code in autovacuum.c
>> correctly - it seems to be working as expected).
> I'm not an expert on the stats system, but this seems like a promising
> approach to me.
>> (a) It does not solve the "many-schema" scenario at all - that'll need
>> a completely new approach I guess :-(
> We don't need to solve every problem in the first patch. I've got no
> problem kicking this one down the road.
>> (b) It does not solve the writing part at all - the current code uses a
>> single timestamp (last_statwrite) to decide if a new file needs to
>> be written.
>> That clearly is not enough for multiple files - there should be one
>> timestamp for each database/file. I'm thinking about how to solve
>> this and how to integrate it with pgstat_send_inquiry etc.
> Presumably you need a last_statwrite for each file, in a hash table or
> something, and requests need to specify which file is needed.
>> And yet another one I'm thinking about is using a fixed-length
>> array of timestamps (e.g. 256), indexed by mod(dboid,256). That
>> would mean stats for all databases with the same mod(oid,256) would
>> be written at the same time. Seems like an over-engineering though.
> That seems like an unnecessary kludge.
>> (c) I'm a bit worried about the number of files - right now there's one
>> for each database and I'm thinking about splitting them by type
>> (one for tables, one for functions) which might make it even faster
>> for some apps with a lot of stored procedures etc.
>> But is the large number of files actually a problem? After all,
>> we're using one file per relation fork in the "base" directory, so
>> this seems like a minor issue.
> I don't see why one file per database would be a problem. After all,
> we already have on directory per database inside base/. If the user
> has so many databases that dirent lookups in a directory of that size
> are a problem, they're already hosed, and this will probably still
> work out to a net win.

Attached is a v2 of the patch, fixing some of the issues and unclear
points from the initial version.

The main improvement is that it implements writing only stats for the
requested database (set when sending inquiry). There's a dynamic array
of request - for each DB only the last request is kept.

I've done a number of changes - most importantly:

- added a stats_timestamp field to PgStat_StatDBEntry, keeping the last
write of the database (i.e. a per-database last_statwrite), which is
used to decide whether the file is stale or not

- handling of the 'permanent' flag correctly (used when starting or
stopping the cluster) for per-db files

- added a very simple header to the per-db files (basically just a
format ID and a timestamp) - this is needed for checking of the
timestamp of the last write from workers (although maybe we could
just read the pgstat.stat, which is now rather small)

- a 'force' parameter (true - write all databases, even if they weren't
specifically requested)

So with the exception of 'multi-schema' case (which was not the aim of
this effort), it should solve all the issues of the initial version.

There are two blocks of code dealing with clock glitches. I haven't
fixed those yet, but that can wait I guess. I've also left there some
logging I've used during development (printing inquiries and which file
is written and when).

The main unsolved problem I'm struggling with is what to do when a
database is dropped? Right now, the statfile remains in pg_stat_tmp
forewer (or until the restart) - is there a good way to remove the
file? I'm thinking about adding a message to be sent to the collector
from the code that handles DROP TABLE.

I've done some very simple performance testing - I've created 1000
databases with 1000 tables each, done ANALYZE on all of them. With only
autovacum running, I've seen this:

Without the patch

18 3.0 0:10.10 postgres: autovacuum launcher process
17 2.6 0:11.44 postgres: stats collector process

The I/O was seriously bogged down, doing ~150 MB/s (basically what the
drive can handle) - with less dbs, or when the statfiles are placed on
tmpfs filesystem, we usually see ~70% of one core doing just this.

With the patch

Then, the typical "top" for PostgreSQL processes looked like this:

2 0.3 1:16.57 postgres: autovacuum launcher process
2 3.1 0:25.34 postgres: stats collector process

and the average write speed from the stats collector was ~3.5MB/s
(measured using iotop), and even when running the ANALYZE etc. I was
getting rather light IO usage (like ~15 MB/s or something).

With both cases, the total size was ~150MB, but without the space
requirements are actually 2x that (because of writing a copy and then

I'd like to put this into 2013-01 commit fest, but if we can do some
prior testing / comments, that'd be great.


Attachment Content-Type Size
stats-split-v2.patch text/plain 26.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Sabino Mullane 2012-12-03 02:38:20 Re: Tablespaces in the data directory
Previous Message Andrew Dunstan 2012-12-03 01:55:28 Re: Tablespaces in the data directory