Re: patch to allow disable of WAL recycling

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: David Pacheco <dap(at)joyent(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch to allow disable of WAL recycling
Date: 2018-07-13 01:06:07
Message-ID: CAEepm=17dBEvzV-je57S8_GDZJEg_O=HmRECvacACAxPQBCg8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I don't follow Alvaro's reasoning, TBH. There's a couple of things that
> confuse me ...
>
> I don't quite see how reusing WAL segments actually protects against full
> filesystem? On "traditional" filesystems I would not expect any difference
> between "unlink+create" and reusing an existing file. On CoW filesystems
> (like ZFS or btrfs) the space management works very differently and reusing
> an existing file is unlikely to save anything.

Yeah, I had the same thoughts.

> But even if it reduces the likelihood of ENOSPC, it does not eliminate it
> entirely. max_wal_size is not a hard limit, and the disk may be filled by
> something else (when WAL is not on a separate device, when there is think
> provisioning, etc.). So it's not a protection against data corruption we
> could rely on. (And as was discussed in the recent fsync thread, ENOSPC is a
> likely source of past data corruption issues on NFS and possibly other
> filesystems.)

Right. That ENOSPC discussion was about checkpointing though, not
WAL. IIUC the hypothesis was that there may be stacks (possibly
involving NFS or thin provisioning, or perhaps historical versions of
certain local filesystems that had reservation accounting bugs, on a
certain kernel) that could let you write() a buffer, and then later
when the checkpointer calls fsync() the filesystem says ENOSPC, the
kernel reports that and throws away the dirty page, and then at next
checkpoint fsync() succeeds but the checkpoint is a lie and the data
is smoke.

We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
in Jerry's nearby stack trace, so that failure mode seems to be
covered already for WAL, no?

> AFAICS the original reason for reusing WAL segments was the belief that
> overwriting an existing file is faster than writing a new file. That might
> have been true in the past, but the question is if it's still true on
> current filesystems. The results posted here suggest it's not true on ZFS,
> at least.

Yeah.

The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
comments about the documentation; we should make sure that the 'off'
setting isn't accidentally recommended to the wrong audience) and I
vote we take it.

Just by the way, if I'm not mistaken ZFS does avoid faulting when
overwriting whole blocks, just like other filesystems:

https://github.com/freebsd/freebsd/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034

So then where are those faults coming from? Perhaps the tree page
that holds the block pointer, of which there must be many when the
recordsize is small?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-07-13 01:47:38 Re: [HACKERS] WAL logging problem in 9.4.3?
Previous Message Michael Paquier 2018-07-13 01:00:59 Re: Cannot dump foreign key constraints on partitioned table