increasing the default WAL segment size

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: increasing the default WAL segment size
Date: 2016-08-25 01:31:35
Message-ID: CA+TgmoZctR8Sqvgxp2-_fncsgvQSCaYZJ7e+oF8XnNLnJwOQ8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'd like to propose that we increase the default WAL segment size,
which is currently 16MB. It was first set to that value in commit
47937403676d913c0e740eec6b85113865c6c8ab in October of 1999; prior to
that, it was 64MB. Between 1999 and now, there have been three
significant changes that make me think it might be time to rethink
this value:

1. Transaction rates are vastly higher these days. In 1999, I think
we were still limited to ~2^32 transactions during the entire lifetime
of the server; transaction ID wraparound hadn't been invented yet.[1]
Today, some installations do that many write transactions in under a
week. The practical consequence of this is that WAL files fill up in
extremely short periods of time. Some users generate multiple
terabytes of WAL per day, which means they are generating - and very
likely archiving - WAL files a rate of greater than 1 per second!
That poses multiple problems. For example, if your archive command
happens to involve ssh, you might run into trouble because of this
sort of thing:

[rhaas pgsql]$ /usr/bin/time ssh hydra true
1.57 real 0.00 user 0.00 sys

Also, your operating system's implementation of directories and the
commands to work with them (like ls) don't necessarily scale well to
tens or hundreds of thousands of archived files.

Furthermore, there is an enforced, synchronous fsync at the end of
every segment, which actually does hurt performance on write-heavy
workloads.[2] Of course, if that were the only reason to consider
increasing the segment size, it would probably make more sense to just
try to push that extra fsync into the background, but that's not
really the case. From what I hear, the gigantic number of files is a
bigger pain point.

2. Disks are a bit larger these days. In the worst case, we waste
just under twice as much space as whatever the segment size is: you
might need 1 byte from the oldest segment you're keeping and 1 byte
from the newest segment that you are keeping, but not the remaining
contents of either file. In 1999, trying to limit disk wastage to
<32MB probably seemed reasonable, but today that's very little disk
space. I think at that time typical hard drive sizes were around 10
GB, whereas today they are around 1 TB.[3] I'm not sure whether the
size of the sorts of high-performance storage that is likely to be
used for pg_xlog has grown as fast as hard drives generally, but even
so it seems pretty clear to me that trying to limit disk wastage to
32MB is excessively conservative on modern hardware.

3. archive_timeout is no longer a frequently used option. Obviously,
if you are frequently archiving partial segments, you don't want the
segment size to be too large, because if it is, each forced segment
switch potentially wastes a large amount of space (and bandwidth).
But given streaming replication and pg_receivexlog, the use case for
archiving partial segments is, at least according to my understanding,
a lot narrower than it used to be. So, I think we don't have to worry
as much about keeping forced segment switches cheap as we did during
the 8.x series.

Considering those three factors, I think we should consider pushing
the default value up somewhat higher for v10. Reverting to the 64MB
size that we had prior to 47937403676d913c0e740eec6b85113865c6c8ab
sounds pretty reasonable. Users with really high transaction rates
might even prefer a higher value (e.g. 256MB, 1GB) but that's hardly
practical for small installs given our default of max_wal_size = 1GB.
Possibly it would make sense for this to be configurable at initdb
time instead of requiring a recompile; we probably don't save any
significant number of cycles by compiling this into the server.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] I believe at that time we consumed an XID even for a read-only
transaction, too; today, we can do 2^32 read transactions in a few
hours.
[2] Amit did some benchmarking on this, I believe, but I don't have
the numbers handy.
[3] https://commons.wikimedia.org/wiki/File:Hard_drive_capacity_over_time.png

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2016-08-25 01:40:06 Re: increasing the default WAL segment size
Previous Message Tomas Vondra 2016-08-25 01:26:49 Re: Better tracking of free space during SP-GiST index build