From: | Guillaume Lelarge <guillaume(at)lelarge(dot)info> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Maximum number of WAL files in the pg_xlog directory |
Date: | 2014-11-03 07:00:28 |
Message-ID: | CAECtzeWXY_v8-eBuC+mZRLs7y94z0ppLSHN2+2t3sJDkhyhb6g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Le 15 oct. 2014 22:25, "Guillaume Lelarge" <guillaume(at)lelarge(dot)info> a écrit
:
>
> 2014-10-15 22:11 GMT+02:00 Jeff Janes <jeff(dot)janes(at)gmail(dot)com>:
>>
>> On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <
guillaume(at)lelarge(dot)info> wrote:
>>>
>>> Hi,
>>>
>>> As part of our monitoring work for our customers, we stumbled upon an
issue with our customers' servers who have a wal_keep_segments setting
higher than 0.
>>>
>>> We have a monitoring script that checks the number of WAL files in the
pg_xlog directory, according to the setting of three parameters
(checkpoint_completion_target, checkpoint_segments, and wal_keep_segments).
We usually add a percentage to the usual formula:
>>>
>>> greatest(
>>> (2 + checkpoint_completion_target) * checkpoint_segments + 1,
>>> checkpoint_segments + wal_keep_segments + 1
>>> )
>>
>>
>> I think the first bug is even having this formula in the documentation
to start with, and in trying to use it.
>>
>
> I agree. But we have customers asking how to compute the right size for
their WAL file system partitions. Right size is usually a euphemism for
smallest size, and they usually tend to get it wrong, leading to huge
issues. And I'm not even speaking of monitoring, and alerting.
>
> A way to avoid this issue is probably to erase the formula from the
documentation, and find a new way to explain them how to size their
partitions for WALs.
>
> Monitoring is another matter, and I don't really think a monitoring
solution should count the WAL files. What actually really matters is the
database availability, and that is covered with having enough disk space in
the WALs partition.
>
>> "and will normally not be more than..."
>>
>> This may be "normal" for a toy system. I think that the normal state
for any system worth monitoring is that it has had load spikes at some
point in the past.
>>
>
> Agreed.
>
>>
>> So it is the next part of the doc, which describes how many segments it
climbs back down to upon recovering from a spike, which is the important
one. And that doesn't mention wal_keep_segments at all, which surely
cannot be correct.
>>
>
> Agreed too.
>
>>
>> I will try to independently derive the correct formula from the code, as
you did, without looking too much at your derivation first, and see if we
get the same answer.
>>
>
> Thanks. I look forward reading what you found.
>
> What seems clear to me right now is that no one has a sane explanation of
the formula. Though yours definitely made sense, it didn't seem to be what
the code does.
>
Did you find time to work on this? Any news?
Thanks.
From | Date | Subject | |
---|---|---|---|
Next Message | Alexey Vasiliev | 2014-11-03 11:04:00 | Patch: add recovery_timeout option to control timeout of restore_command nonzero status code |
Previous Message | Rushabh Lathia | 2014-11-03 05:12:45 | Re: CINE in CREATE TABLE AS ... and CREATE MATERIALIZED VIEW ... |