Re: WAL Re-Writes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: Jan Wieck <jan(at)wi3ck(dot)info>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL Re-Writes
Date: 2016-02-03 05:42:36
Message-ID: CAA4eK1Ko-jaPa_0ug5S+a2WCOb33mWpAniQrfRKWpb6Hb_8jog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:

> On 1/31/16 3:26 PM, Jan Wieck wrote:
>
>> On 01/27/2016 08:30 AM, Amit Kapila wrote:
>>
>>> operation. Now why OS couldn't find the corresponding block in
>>> memory is that, while closing the WAL file, we use
>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
>>> lead to this problem. So with this experiment, the conclusion is that
>>> though we can avoid re-write of WAL data by doing exact writes, but
>>> it could lead to significant reduction in TPS.
>>>
>>
>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>> robin fashion. In a properly configured system, where the reason for a
>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>> written to had been closed and not touched for about a complete
>> checkpoint_timeout or longer. You must have a really big amount of spare
>> RAM in the machine to still find those blocks in memory. Basically we
>> are talking about the active portion of your database, shared buffers,
>> the sum of all process local memory and the complete pg_xlog directory
>> content fitting into RAM.
>>
>

I think that could only be problem if reads were happening at write or
fsync call, but that is not the case here. Further investigation on this
point reveals that the reads are not for fsync operation, rather they
happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
Although this behaviour (writing in non-OS-page-cache-size chunks could
lead to reads if followed by a call to posix_fadvise
(,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
reason for the same is that fadvise() call maps the specified data range
(which in our case is whole file) into the list of pages and then invalidate
them which will further lead to removing them from OS cache, now any
misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file
could cause additional reads as everything written by us will not be on
OS-page-boundary. This theory is based on code of fadvise [1] and some
googling [2] which suggests that misaligned reads followed with
POSIX_FADV_DONTNEED could cause similar problem. Colleague of
mine, Dilip Kumar has verified it even by writing a simple program
for open/write/fsync/fdvise/close as well.

>
> But that's only going to matter when the segment is newly recycled. My
> impression from Amit's email is that the OS was repeatedly reading even in
> the same segment?
>
>
As explained above the reads are only happening during file close.

> Either way, I would think it wouldn't be hard to work around this by
> spewing out a bunch of zeros to the OS in advance of where we actually need
> to write, preventing the need for reading back from disk.
>
>
I think we can simply prohibit to set wal_chunk_size to a value other
than OS-page-cache or XLOG_BLCKSZ (whichever is lesser) if the
wal_level is lesser than archive. This can avoid the problem of extra
reads for misaligned writes as we won't call fadvise().

We can even choose to always write in OS-page-cache boundary
or XLOG_BLCKSZ (whichever is lesser) as in many cases
OS-page-cache boundary is 4K which can also save significant
re-writes.

> Amit, did you do performance testing with archiving enabled an a no-op
> archive_command?
>

No, but what kind of advantage are you expecting from such
tests?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2016-02-03 05:46:59 Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Previous Message Robert Haas 2016-02-03 04:10:39 Re: Raising the checkpoint_timeout limit