Re: Refactoring the checkpointer's fsync request queue

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Shawn Debnath <sdn(at)amazon(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject: Re: Refactoring the checkpointer's fsync request queue
Date: 2018-11-12 02:58:41
Message-ID: CAEepm=01Qx5XGyE41aME5w7MXAdidfw-+AssfR6A6OT_z=ba8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

(Adding Dmitry to CC list.)

On Tue, Oct 16, 2018 at 12:02 AM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> There is one major fly in the ointment: fsyncgate[1]. Originally I
> planned to propose a patch on top of that one, but it's difficult --
> both patches move a lot of the same stuff around. Personally, I don't
> think it would be a very good idea to back-patch that anyway. It'd be
> riskier than the problem it aims to solve, in terms of bugs and
> hard-to-foresee portability problems IMHO. I think we should consider
> back-patching some variant of Craig Ringer's PANIC patch, and consider
> this redesigned approach for future releases.
>
> So, please find attached the WIP patch that I would like to propose
> for PostgreSQL 12, under a separate Commitfest entry. It incorporates
> the fsyncgate work by Andres Freund (original file descriptor transfer
> POC) and me (many bug fixes and improvements), and the refactoring
> work as described above.

Here is a rebased version of the patch, post pread()/pwrite(). I have
also rewritten the commit message to try to explain the rationale
concisely, instead of requiring the reader to consult multiple
discussions that jump between lengthy email threads to understand the
key points.

There is one major problem with this patch: BufferSync(), run in the
checkpointer, can deadlock against a backend that holds a buffer lock
and is blocked in SendFsyncRequest(). To break this deadlock, we need
way out of it on either the sending or receiving side. Here are three
ideas:

1. Go back to the current pressure-valve strategy: make the sending
side perform the fsync(), if it can't immediately write to the pipe.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

3. Multiplex the checkpointer's work: Use LWLockConditionalAcquire()
when locking buffers, and if that fails, try to drain the pipe, and
then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop. I
can hear you groan already; that doesn't seem particularly elegant,
and there are portability problems implementing LWLockTimedAcquire():
semtimedop() and sem_timedwait() are not available on all platforms
(eg macOS). Maybe pthread_timed_condwait() could help (!).

I'm not actually sure if idea 1 is correct, and I also don't like that
behaviour under pressure, and I think pressure is more likely than in
current master (since we gave up sender-side queue compaction, and it
seems quite easy to fill up the pipe's buffer). Number 2 appeals to
me the most right now, but I haven't looked into the details or tried
it yet. Number 3 is a straw man that perhaps helps illustrate the
problem but involves taking on unnecessary new problems and seems like
a non-starter. So, is there any reason the bgwriter process shouldn't
do that work on the checkpointer's behalf? Is there another way?

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
0001-Refactor-the-checkpointer-s-data-sync-request-que-v2.patch application/x-patch 114.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-11-12 03:07:52 Re: zheap: a new storage format for PostgreSQL
Previous Message Amit Langote 2018-11-12 02:47:27 Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation