Re: WAL Re-Writes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Jan Wieck <jan(at)wi3ck(dot)info>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL Re-Writes
Date: 2016-02-09 03:42:55
Message-ID: CAA4eK1+qaXzWD7b=+XQy51QWdfPmJvpofbvSbHSofcma+ezb8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 8, 2016 at 8:11 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Feb 8, 2016 at 12:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > I think deciding it automatically without user require to configure it,
> > certainly has merits, but what about some cases where user can get
> > benefits by configuring themselves like the cases where we use
> > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> > buffers and won't cause misaligned writes even for smaller chunk sizes
> > like 512 bytes or so). Some googling [1] reveals that other databases
> > also provides user with option to configure wal block/chunk size (as
> > BLOCKSIZE), although they seem to decide chunk size based on
> > disk-sector size.
>
> Well, if you can prove that we need that flexibility, then we should
> have a GUC. Where's the benchmarking data to support that conclusion?
>

It is not posted as some more work is needed to complete the
benchmarks results when PG_O_DIRECT is used (mainly with
open_sync and open_datasync). I will do so. But, I think main thing
which needs to be taken care is that as smaller-chunk sized writes are
useful only in some cases, we need to ensure that users should not
get baffled by the same. So there are multiple ways to provide the same,

a) at the startup, we ensure that if the user has set smaller chunk-size
(other than 4KB which will be default as decided based on the way
described by you upthread at configure time) and it can use PG_O_DIRECT
as we decide in get_sync_bit(), then allow it, otherwise either return an
error or just set it to default which is 4KB.

b) mention in docs that it better not to tinker with wal_chunk_size guc
unless you have other relevant settings (like wal_sync_method =
open_sync or open_datasync) and wal_level as default.

c) there is yet another option which is, let us do with 4KB sized
chunks for now as the benefit for not doing so only in sub-set of
cases we can support.

The reason why I think it is beneficial to provide an option of writing in
smaller chunks is that it could lead to reduce the amount of re-writes
by higher percentage where they can be used. For example at 4KB,
there is ~35% reduction, similarly at smaller chunks it could gives us
saving unto 50% or 70% depending on the chunk_size.

>
> > An additional thought, which is not necessarily related to this patch
is,
> > if user chooses and or we decide to write in 512 bytes sized chunks,
> > which is usually a disk sector size, then can't we think of avoiding
> > CRC for each record for such cases, because each WAL write in
> > it-self will be atomic. While reading, if we process in wal-chunk-sized
> > units, then I think it should be possible to detect end-of-wal based
> > on data read.
>
> Gosh, taking CRCs off of WAL records sounds like a terrible idea. I'm
> not sure why you think that writing in sector-sized chunks would make
> that any more safe, because to me it seems like it wouldn't. But even
> if it does, it's hard to believe that we don't derive some reliability
> from CRCs that we would lose without them.
>

I think here the point is not about more-safety, rather it is about whether
writing in disk-sector sizes gives equal reliability as CRC's, because
if it does, then not doing crc calculation for each record both while
writing and during replay can save CPU and should intern lead to better
performance. Now, the reason why I thought it could give equal-reliability
is that as disk-sector writes are atomic, so it should buy us that
reliability.
I admit that much more analysis/research is required before doing that
and we can do that later if it proves to be any valuable in terms of
performance and reliability. Here, I mentioned to say that writing in
smaller chunks have other potential benefits.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-02-09 03:55:24 Re: Tracing down buildfarm "postmaster does not shut down" failures
Previous Message Noah Misch 2016-02-09 03:39:44 Re: Tracing down buildfarm "postmaster does not shut down" failures