Re: WAL Re-Writes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: Jan Wieck <jan(at)wi3ck(dot)info>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL Re-Writes
Date: 2016-02-03 12:28:05
Message-ID: CAA4eK1KG4pO5x5Z_Sum0u2FG66xowajz1qcoe=6+mY5_Q1x0+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 3, 2016 at 11:12 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>
> On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
wrote:
>>
>> On 1/31/16 3:26 PM, Jan Wieck wrote:
>>>
>>> On 01/27/2016 08:30 AM, Amit Kapila wrote:
>>>>
>>>> operation. Now why OS couldn't find the corresponding block in
>>>> memory is that, while closing the WAL file, we use
>>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
>>>> lead to this problem. So with this experiment, the conclusion is that
>>>> though we can avoid re-write of WAL data by doing exact writes, but
>>>> it could lead to significant reduction in TPS.
>>>
>>>
>>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>>> robin fashion. In a properly configured system, where the reason for a
>>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>>> written to had been closed and not touched for about a complete
>>> checkpoint_timeout or longer. You must have a really big amount of spare
>>> RAM in the machine to still find those blocks in memory. Basically we
>>> are talking about the active portion of your database, shared buffers,
>>> the sum of all process local memory and the complete pg_xlog directory
>>> content fitting into RAM.
>
>
>
> I think that could only be problem if reads were happening at write or
> fsync call, but that is not the case here. Further investigation on this
> point reveals that the reads are not for fsync operation, rather they
> happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
> Although this behaviour (writing in non-OS-page-cache-size chunks could
> lead to reads if followed by a call to posix_fadvise
> (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
> reason for the same is that fadvise() call maps the specified data range
> (which in our case is whole file) into the list of pages and then
invalidate
> them which will further lead to removing them from OS cache, now any
> misaligned (w.r.t OS page-size) writes done during writing/fsyncing to
file
> could cause additional reads as everything written by us will not be on
> OS-page-boundary.
>

On further testing, it has been observed that misaligned writes could
cause reads even when blocks related to file are not in-memory, so
I think what Jan is describing is right. The case where there is
absolutely zero chance of reads is when we write in OS-page boundary
which is generally 4K. However I still think it is okay to provide an
option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
for the cases when these are beneficial like when wal_level is
greater than equal to Archive and keep default as OS-page size if
the same is smaller than 8K.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-02-03 13:42:41 Re: WAL Re-Writes
Previous Message Thomas Munro 2016-02-03 10:46:33 Re: Proposal: "Causal reads" mode for load balancing reads without stale data