PATCH: Split stats file per database WAS: autovacuum stress-testing our system

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Date: 2013-01-02 23:15:45
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


attached is a version of the patch that I believe is ready for the
commitfest. As the patch was discussed over a large number of messages,
I've prepared a brief summary for those who did not follow the thread.


The patch aims to improve the situation in deployments with many tables
in many databases (think for example 1000 tables in 1000 databases).
Currently all the stats for all the objects (dbs, tables and functions)
are written in a single file (pgstat.stat), which may get quite large
and that consequently leads to various issues:

1) I/O load - the file is frequently written / read, which may use a
significant part of the I/O bandwidth. For example we'have to deal with
cases when the pgstat.stat size is >150MB, and it's written (and read)
continuously (once it's written, a new write starts) and utilizes 100%
bandwidth on that device.

2) CPU load - a common solution to the previous issue is moving the file
into RAM, using a tmpfs filesystem. That "fixes" the I/O bottleneck but
causes high CPU load because the system is serializing and deserializing
large amounts of data. We often see ~1 CPU core "lost" due to this (and
causing higher power consumption, but that's Amazon's problem ;-)

3) disk space utilization - the pgstat.stat file is updated in two
steps, i.e. a new version is written to another file (pgstat.tmp) and
then it's renamed to pgstat.stat, which means the device (amount of RAM,
if using tmpfs device) needs to be >2x the actual size of the file.
(Actually more, because there may be descriptors open to multiple
versions of the file.)

This patch does not attempt to fix a "single DB with multiple schemas"
scenario, although it should not have a negative impact on it.

What the patch does

1) split into global and per-db files

The patch "splits" the huge pgstat.stat file into smaller pieces - one
"global" one (global.stat) with database stats, and one file for each of
the databases (oid.stat) with table.

This makes it possible to write/read much smaller amounts of data, because

a) autovacuum launcher does not need to read the whole file - it needs
just the list of databases (and not the table/func stats)

b) autovacuum workers do request a fresh copy of a single database, so
the stats collector may write just the global.stat + one of the per-db files

and that consequently leads to much lower I/O and CPU load. During our
tests we've seen the I/O to drop from ~150MB/s to less than 4MB/s, and
much lower CPU utilization.

2) a new global/stat directory

The pgstat.stat file was originally saved into the "global" directory,
but with so many files that would get rather messy so I've created a new
global/stat directory and all the files are stored there.

This also means we can do a simple "delete files in the dir" when
pgstat_reset_all is called.

3) pgstat_(read|write)_statsfile split

These two functions were moved into a global and per-db functions, so
now there's

pgstat_write_statsfile -- global.stat
pgstat_write_db_statsfile -- oid.stat

pgstat_read_statsfile -- global.stat
pgstat_read_db_statsfile -- oid.stat

There's a logic to read/write only those files that are actually needed.

4) list of (OID, timestamp) inquiries, last db-write

Originally there was a single pair of request/write timestamps for the
whole file, updated whenever a worker requested a fresh file or when the
file was written.

With the split, this had to be replaced by two lists - a timestamp of
the last inquiry (per DB), and a timestamp when each database file was
written for the last time.

The timestamp of the last DB write was added to the PgStat_StatDBEntry
and the list of inquiries is kept in last_statrequests. The fields are
used at several places, so it's probably best to see the code.

Handling the timestamps is a rather complex stuff because of the clock
skews. One of those checks is not needed as the list of inquiries is
freed right after writing all the databases. But I wouldn't be surprised
if there was something I missed, as splitting the file into multiple
pieces made this part more complex.

So please, if you're going to review this patch this is one of the
tricky places.

5) dummy file

A special handling is necessary when an inquiry arrives for a database
without a PgStat_StatDBEntry - this happens for example right after
initdb, when there are no stats for template0 and template1, yet the
autovacuum workers do send inqiries for them.

The backend_read_statsfile now uses the timestamp stored in the header
of the per-db file (not in the global one), and the easies way to handle
this for new databases is writing an empty 'dummy file' (just a header
with timestamp). Without this, this would result in 'pgstat wait
timeout' errors.

This is what pgstat_write_db_dummyfile (used in pgstat_write_statsfile)
is for.

6) format ID

I've bumped PGSTAT_FILE_FORMAT_ID to a new random value, although the
filenames changed to so we could live with the old ID just fine.

We've done a fair amount of testing so far, and if everything goes fine
we plan to deploy a back-ported version of this patch (to 9.1) on a
production in ~2 weeks.

Then I'll be able to provide some numbers from a real-world workload
(although our deployment and workload is not quite usual I guess).


Attachment Content-Type Size
stats-split-v4.patch text/plain 37.3 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2013-01-02 23:25:48 Re: dynamic SQL - possible performance regression in 9.2
Previous Message Robert Haas 2013-01-02 22:51:55 Re: proposal: ANSI SQL 2011 syntax for named parameters