Re: New sync commit mode remote_write

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New sync commit mode remote_write
Date: 2012-04-24 16:00:16
Message-ID: CA+TgmobS0R0c6236nJXJMCrisCqZEHHhq2-S+G22tHPi6TjBvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 20, 2012 at 3:58 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sat, Apr 21, 2012 at 12:20 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Thu, Apr 19, 2012 at 7:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On 4/19/12, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>>> The work around would be for the master to refuse to automatically
>>>> restart after a crash, insisting on a fail-over instead (or a manual
>>>> forcing of recovery)?
>>>
>>> I suppose that would work, but I think Simon's idea is better: don't
>>> let the slave replay the WAL until either (a) it's promoted or (b) the
>>> master finishes the fsync.   That boils down to adding some more
>>> handshaking to the replication protocol, I think.
>>
>> It would be 8 bytes on every data message sent to the standby.
>
> There seems to be another problem to solve. In current design of streaming
> replication, we cannot send any WAL records before writing them locally.
> Which would mess up the mode which makes a transaction wait for remote
> write but not local one. We should change walsender so that it can send
> WAL records before they are written, e.g., send from wal_buffers?

In theory, writing WAL should be quick, since it's only going out to
the OS cache, and flushing it should be the slow part, since that
involves waiting for the actual disk write to complete. Some
instrumentation I shoved in here reveals that there actually are some
cases where the write can take a long time, when Linux starts to get
worried about the amount of dirty data in cache and punishes anyone
who tries to write anything, but I'm not sure whether that's common
enough to warrant a change here.

One thing that does seem to be a problem is using WALWriteLock to
cover both the WAL write and the WAL flush. Suppose that we're
writing WAL very quickly, so that wal_buffers fills up. We can't
continue writing WAL until some of what's in the buffer has been
*written*, but the WAL writer process will grab WALWriteLock, write
*and flush* a chunk of WAL, and everybody who wants to insert WAL has
to wait for both the write and the flush. It's probably possible to
do better, here. Streaming directly from wal_buffers would allow sync
rep to dodge this problem altogether, but it's a general performance
problem as well so it would be nice to have a general solution that
would improve latency and throughput across the board, if such a
solution is possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-04-24 16:21:49 Re: New sync commit mode remote_write
Previous Message Robert Haas 2012-04-24 15:17:09 Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)