Re: Refactoring the checkpointer's fsync request queue

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Shawn Debnath <sdn(at)amazon(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject: Re: Refactoring the checkpointer's fsync request queue
Date: 2018-11-16 22:38:03
Message-ID: CAEepm=3c6ybA7zUCQEuFees7o6JGhNGdS3V7ay3EdCU6avDaaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 17, 2018 at 4:05 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2018-11-14 16:36:49 -0500, Robert Haas wrote:
> > > But how do you make reading that counter atomic with the open() itself?
> >
> > I don't see why it has to be. As long as the "fd generation" assignment
> > happens before fsync (and writes secondarily), there ought not to be any
> > further need for synchronizity?
>
> If the goal is to have the FD that is opened first end up in the
> checkpointer's table, grabbing a counter backwards does not achieve
> it, because there's a race.
>
> S1: open FD
> S2: open FD
> S2: local_counter = shared_counter++
> S1: local_counter = shared_counter++
>
> Now S1 was opened first but has a higher shared counter value than S2
> which was opened later. Does that matter? Beats me! I just work
> here...

It's not important for the sequence numbers to match the opening order
exactly (that'd work too but be expensive to orchestrate). It's
important for the sequence numbers to be assigned before each backend
does its first pwrite(). That gives us the following interleavings to
worry about:

S1: local_counter = shared_counter++
S2: local_counter = shared_counter++
S1: pwrite()
S2: pwrite()

S1: local_counter = shared_counter++
S2: local_counter = shared_counter++
S2: pwrite()
S1: pwrite()

S1: local_counter = shared_counter++
S1: pwrite()
S2: local_counter = shared_counter++
S2: pwrite()

... plus the same interleavings with S1 and S2 labels swapped. In all
6 orderings, the fd that has the lowest sequence number can see errors
relating to write-back of kernel buffers dirtied by both pwrite()
calls.

Or to put it another way, you can't be given a lower sequence number
than another process that has already written, because that other
process must have been given a sequence number before it wrote.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2018-11-16 22:41:07 Re: Early WIP/PoC for inlining CTEs
Previous Message Alvaro Herrera 2018-11-16 22:36:11 Re: [HACKERS] pgbench - allow to store select results into variables