Re: Maximum number of WAL files in the pg_xlog directory

From: Guillaume Lelarge <guillaume(at)lelarge(dot)info>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Maximum number of WAL files in the pg_xlog directory
Date: 2014-10-15 20:25:16
Message-ID: CAECtzeWtviowR9w7wTD63uCF6NPNiVm05K0MvAF=d_jNH==wGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2014-10-15 22:11 GMT+02:00 Jeff Janes <jeff(dot)janes(at)gmail(dot)com>:

> On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <guillaume(at)lelarge(dot)info
> > wrote:
>
>> Hi,
>>
>> As part of our monitoring work for our customers, we stumbled upon an
>> issue with our customers' servers who have a wal_keep_segments setting
>> higher than 0.
>>
>> We have a monitoring script that checks the number of WAL files in the
>> pg_xlog directory, according to the setting of three parameters
>> (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments).
>> We usually add a percentage to the usual formula:
>>
>> greatest(
>> (2 + checkpoint_completion_target) * checkpoint_segments + 1,
>> checkpoint_segments + wal_keep_segments + 1
>> )
>>
>
> I think the first bug is even having this formula in the documentation to
> start with, and in trying to use it.
>
>
I agree. But we have customers asking how to compute the right size for
their WAL file system partitions. Right size is usually a euphemism for
smallest size, and they usually tend to get it wrong, leading to huge
issues. And I'm not even speaking of monitoring, and alerting.

A way to avoid this issue is probably to erase the formula from the
documentation, and find a new way to explain them how to size their
partitions for WALs.

Monitoring is another matter, and I don't really think a monitoring
solution should count the WAL files. What actually really matters is the
database availability, and that is covered with having enough disk space in
the WALs partition.

"and will normally not be more than..."
>
> This may be "normal" for a toy system. I think that the normal state for
> any system worth monitoring is that it has had load spikes at some point in
> the past.
>
>
Agreed.

> So it is the next part of the doc, which describes how many segments it
> climbs back down to upon recovering from a spike, which is the important
> one. And that doesn't mention wal_keep_segments at all, which surely
> cannot be correct.
>
>
Agreed too.

> I will try to independently derive the correct formula from the code, as
> you did, without looking too much at your derivation first, and see if we
> get the same answer.
>
>
Thanks. I look forward reading what you found.

What seems clear to me right now is that no one has a sane explanation of
the formula. Though yours definitely made sense, it didn't seem to be what
the code does.

--
Guillaume.
http://blog.guillaume.lelarge.info
http://www.dalibo.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2014-10-15 21:09:44 Re: jsonb generator functions
Previous Message Andres Freund 2014-10-15 20:23:39 Re: WIP: dynahash replacement for buffer table