BUG #4801: Performance failure: 600 MB written to each WAL log

From: "Peter Much" <pmc(at)citylink(dot)dinoex(dot)sub(dot)org>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #4801: Performance failure: 600 MB written to each WAL log
Date: 2009-05-08 12:26:51
Message-ID: 200905081226.n48CQpeb004919@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 4801
Logged by: Peter Much
Email address: pmc(at)citylink(dot)dinoex(dot)sub(dot)org
PostgreSQL version: 8.2.13
Operating system: FreeBSD 7.2
Description: Performance failure: 600 MB written to each WAL log
Details:

Server becomes very slow while tremendous amounts of data are written
to WAL logs - disk activity log shows 600 MB of effective write requests
before each 16 MB WAL log is archived and recycled.

Problem appears occasionally all of a sudden with no apparent reason.
I perveived it with 8.2.5(?), 8.2.11 and now 8.2.13, with FreeBSD Rel. 6.3
and 7.2.

I perceived it only during "vaccum full" yet, but then a long-running
"vacuum full" on a large table is the only significant activity on the
server. (I know that "vacuum full" may not be really necessary, but
I think it should practically function.)

When starting the "vacuum full", the activity looks what I think normal:
there is heavy activity on the database table files, accompanied by
a certain amount of accesses to the WAL logs. The amount of disk writes
to WAL then figures (more or less) to the WAL switching frequency.
And it may stay so and the vacuum may complete successful.

But sometimes (same "vacuum full", same table) the server suddenly
changes into a very slow state of proceeding: then there is only few
activity on the database table disk, while the WAL disk is busy writing
at 100%. But the WAL logs are filled very slow, too.
I can watch the proceeding of the vacuum by listing the open file
handle offsets on the database table files (with lsof). And the ratio
between these offsets proceeding and the number of archived WAL logs
is the same in both states.

Also, the server functions correctly by all means when in this slow
state, and all requests are processed. And there are no error messages.
The only difference is that in normal state of operation the amount of
data written to one WAL log is just some more than the 16 MB it contains,
while in the slow state it is about 600 MB - and this slows down
operation. (The vacuum job must be killed then, because it would take
days instead of hours to complete.)

I have not really figured out the conditions that do bring the server
into this slow mode of operation. It seems to happen when a certain
amount of other activity (on other databases within the same server)
occurs. That is, as long as there is not much other activity on the
system, the vacuum will usually complete with normal speed.

But after the server has fallen into the slow state, it will not get
out of it again. It has to be stopped and restarted to get back to
normal speed; killing and restarting the vacuum process will not
help.

The WAL handling configuration in the config file is as default, only
I do WAL archiving. I have tried to increae wal_buffers, this had no
effect.
I have moved the whole installation from FreeBSD-UFS diskspace to ZFS,
and have switched off full_page_writes. This does significantly reduce
the absolute amount of produced WALs, and therefore the slow state of
operation becomes faster than before, but besides this the behaviour
did not change.

The command "show wal_sync_method" yields "fsync".

Over all, I do not think that the problem relates to WAL storage. It
seems to be mainly the question: what is this tremendous amount of
-obviousely redundant- data that is written to WAL, and how do I get
the server out of this mode again without disrupting operation?

One other thing should be remarked here: The server is a very small
system. It has not much memory and will do a real lot of paging; the
disks are also not fast.
I occasionally experience that under such conditions software shows
bugs that would never become visible on the usual oversized systems.

Configuration of the server is as follows (as changed from defaults):
> max_connections = 40
> shared_buffers = 28MB
> max_fsm_pages = 179200
> effective_cache_size = 12MB
> autovacuum = off
(autovacuum is off to reduce noise - disks will spindown when
databases are not used.)

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Rafael Martinez 2009-05-08 12:59:04 Re: BUG #4801: Performance failure: 600 MB written to each WAL log
Previous Message Alex 2009-05-08 10:36:18 BUG #4800: constraint_exclusion could be smarter with bool conversion