Re: Maximum number of WAL files in the pg_xlog directory

From: Guillaume Lelarge <guillaume(at)lelarge(dot)info>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Maximum number of WAL files in the pg_xlog directory
Date: 2014-11-03 07:00:28
Message-ID: CAECtzeWXY_v8-eBuC+mZRLs7y94z0ppLSHN2+2t3sJDkhyhb6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Le 15 oct. 2014 22:25, "Guillaume Lelarge" <guillaume(at)lelarge(dot)info> a écrit
:
>
> 2014-10-15 22:11 GMT+02:00 Jeff Janes <jeff(dot)janes(at)gmail(dot)com>:
>>
>> On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <
guillaume(at)lelarge(dot)info> wrote:
>>>
>>> Hi,
>>>
>>> As part of our monitoring work for our customers, we stumbled upon an
issue with our customers' servers who have a wal_keep_segments setting
higher than 0.
>>>
>>> We have a monitoring script that checks the number of WAL files in the
pg_xlog directory, according to the setting of three parameters
(checkpoint_completion_target, checkpoint_segments, and wal_keep_segments).
We usually add a percentage to the usual formula:
>>>
>>> greatest(
>>> (2 + checkpoint_completion_target) * checkpoint_segments + 1,
>>> checkpoint_segments + wal_keep_segments + 1
>>> )
>>
>>
>> I think the first bug is even having this formula in the documentation
to start with, and in trying to use it.
>>
>
> I agree. But we have customers asking how to compute the right size for
their WAL file system partitions. Right size is usually a euphemism for
smallest size, and they usually tend to get it wrong, leading to huge
issues. And I'm not even speaking of monitoring, and alerting.
>
> A way to avoid this issue is probably to erase the formula from the
documentation, and find a new way to explain them how to size their
partitions for WALs.
>
> Monitoring is another matter, and I don't really think a monitoring
solution should count the WAL files. What actually really matters is the
database availability, and that is covered with having enough disk space in
the WALs partition.
>
>> "and will normally not be more than..."
>>
>> This may be "normal" for a toy system. I think that the normal state
for any system worth monitoring is that it has had load spikes at some
point in the past.
>>
>
> Agreed.
>
>>
>> So it is the next part of the doc, which describes how many segments it
climbs back down to upon recovering from a spike, which is the important
one. And that doesn't mention wal_keep_segments at all, which surely
cannot be correct.
>>
>
> Agreed too.
>
>>
>> I will try to independently derive the correct formula from the code, as
you did, without looking too much at your derivation first, and see if we
get the same answer.
>>
>
> Thanks. I look forward reading what you found.
>
> What seems clear to me right now is that no one has a sane explanation of
the formula. Though yours definitely made sense, it didn't seem to be what
the code does.
>

Did you find time to work on this? Any news?

Thanks.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Vasiliev 2014-11-03 11:04:00 Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Previous Message Rushabh Lathia 2014-11-03 05:12:45 Re: CINE in CREATE TABLE AS ... and CREATE MATERIALIZED VIEW ...