Re: patch to allow disable of WAL recycling

From: Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Pacheco <dap(at)joyent(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch to allow disable of WAL recycling
Date: 2018-07-13 13:09:21
Message-ID: CACPQ5FruwJx+x_oLt-vVjJoKvBVepYqJW++CJ9-aywBAbPrhFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks to everyone who has taken the time to look at this patch and provide
all of the feedback.

I'm going to wait another day to see if there are any more comments. If
not, then first thing next week, I will send out a revised patch with
improvements to the man page change as requested. If anyone has specific
things they want to be sure are covered, please just let me know.

Thanks again,
Jerry

On Thu, Jul 12, 2018 at 7:06 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com
> wrote:

> On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > I don't follow Alvaro's reasoning, TBH. There's a couple of things that
> > confuse me ...
> >
> > I don't quite see how reusing WAL segments actually protects against full
> > filesystem? On "traditional" filesystems I would not expect any
> difference
> > between "unlink+create" and reusing an existing file. On CoW filesystems
> > (like ZFS or btrfs) the space management works very differently and
> reusing
> > an existing file is unlikely to save anything.
>
> Yeah, I had the same thoughts.
>
> > But even if it reduces the likelihood of ENOSPC, it does not eliminate it
> > entirely. max_wal_size is not a hard limit, and the disk may be filled by
> > something else (when WAL is not on a separate device, when there is think
> > provisioning, etc.). So it's not a protection against data corruption we
> > could rely on. (And as was discussed in the recent fsync thread, ENOSPC
> is a
> > likely source of past data corruption issues on NFS and possibly other
> > filesystems.)
>
> Right. That ENOSPC discussion was about checkpointing though, not
> WAL. IIUC the hypothesis was that there may be stacks (possibly
> involving NFS or thin provisioning, or perhaps historical versions of
> certain local filesystems that had reservation accounting bugs, on a
> certain kernel) that could let you write() a buffer, and then later
> when the checkpointer calls fsync() the filesystem says ENOSPC, the
> kernel reports that and throws away the dirty page, and then at next
> checkpoint fsync() succeeds but the checkpoint is a lie and the data
> is smoke.
>
> We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
> in Jerry's nearby stack trace, so that failure mode seems to be
> covered already for WAL, no?
>
> > AFAICS the original reason for reusing WAL segments was the belief that
> > overwriting an existing file is faster than writing a new file. That
> might
> > have been true in the past, but the question is if it's still true on
> > current filesystems. The results posted here suggest it's not true on
> ZFS,
> > at least.
>
> Yeah.
>
> The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
> comments about the documentation; we should make sure that the 'off'
> setting isn't accidentally recommended to the wrong audience) and I
> vote we take it.
>
> Just by the way, if I'm not mistaken ZFS does avoid faulting when
> overwriting whole blocks, just like other filesystems:
>
> https://github.com/freebsd/freebsd/blob/master/sys/cddl/
> contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034
>
> So then where are those faults coming from? Perhaps the tree page
> that holds the block pointer, of which there must be many when the
> recordsize is small?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2018-07-13 13:10:11 Re: How to make partitioning scale better for larger numbers of partitions
Previous Message Oliver Ford 2018-07-13 12:52:00 Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options