BUG #7494: WAL replay speed depends heavily on the shared_buffers size

From: valgog(at)gmail(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
Date: 2012-08-15 10:10:42
Message-ID: E1T1aYk-0007Jk-3h@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 7494
Logged by: Valentine Gogichashvili
Email address: valgog(at)gmail(dot)com
PostgreSQL version: 9.0.7
Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
Description:

We are experiencing strange(?) behavior on the replication slave machines.
The master machine has a very heavy update load, where many processes are
updating lots of data. It generates up to 30GB of WAL files per hour.
Normally it is not a problem for the slave machines to replay this amount of
WAL files on time and keep on with the master. But at some moments, the
slaves are “hanging” with 100% CPU usage on the WAL replay process and 3%
IOWait, needing up to 30 seconds to process one WAL file. If this tipping
point is reached, then a huge WAL replication lag is building up quite fast,
that also leads to overfill of the XLOG directory on the slave machines, as
the WAL receiver is putting the WAL files it gets via streaming replication
the XLOG directory (that, in many cases are quite a limited size separate
disk partition).

What we noticed also, is that reducing shared_buffers parameter from our
normal 20-32 GB for the slave machines, to 2 GB increases the speed of WAL
replay dramatically. After restart of the slave machine with much lower
shared_buffers values, the replay becomes up to 10-20 times faster.

On the attached graph, there is a typical graph of WAL replication delay for
one of the slaves.

In that graph small (up to 6GB) replication delay peaks during the night are
caused by some long running transactions, stopping WAL replay on this slave,
to prevent replication collisions. But the last, big peaks are sometimes
start because of that waiting for a long running transaction on the slave,
but then they are growing as described above.

I know, that there is only one process that replays data, generated by many
threads on master machine. But why does the replay performance depend so
much on the shared_buffers parameter and can it be optimized?

With best regards,

Valentine Gogichashvili

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thom Brown 2012-08-15 13:00:52 Re: pg_dump dependency loop with extension containing its own schema
Previous Message Heikki Linnakangas 2012-08-15 06:50:36 Re: ERROR - CREATE GIST INDEX on 9.2 beta3