Re: Inconsistent DB data in Streaming Replication

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Sameer Thakur <samthakur74(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, sthomas(at)optionshouse(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Samrat Revagade <revagade(dot)samrat(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: Inconsistent DB data in Streaming Replication
Date: 2013-04-11 17:35:08
Message-ID: CAHGQGwHbQLXmt3Ci0bA_cxG=VvOhSE1HSUSVu_QofZ2fFHb-_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 12, 2013 at 12:09 AM, Ants Aasma <ants(at)cybertec(dot)at> wrote:
> On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com> wrote:
>> On 04/11/2013 03:52 PM, Ants Aasma wrote:
>>>
>>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com>
>>> wrote:
>>>>
>>>> The proposed fix - halting all writes of data pages to disk and
>>>> to WAL files while waiting ACK from standby - will tremendously
>>>> slow down all parallel work on master.
>>>
>>> This is not what is being proposed. The proposed fix halts writes of
>>> only data pages that are modified within the window of WAL that is not
>>> yet ACKed by the slave. This means pages that were recently modified
>>> and where the clocksweep or checkpoint has decided to evict them. This
>>> only affects the checkpointer, bgwriter and backends doing allocation.
>>> Furthermore, for the backend clocksweep case it would be reasonable to
>>> just pick another buffer to evict. The slowdown for most actual cases
>>> will be negligible.
>>
>> You also need to hold back all WAL writes, including the ones by
>> parallel async and locally-synced transactions. Which means that
>> you have to make all locally synced transactions to wait on the
>> syncrep transactions committed before them.
>> After getting the ACK from slave you then have a backlog of stuff
>> to write locally, which then also needs to be sent to slave. Basically
>> this turns a nice smooth WAL write-and-stream pipeline into a
>> chunky wait-and-write-and-wait-and-stream-and-wait :P
>> This may not be a problem in slight write load cases, which is
>> probably the most widely happening usecase for postgres, but it
>> will harm top performance and also force people to get much
>> better (and more expensive) hardware than would otherways
>> be needed.
>
> Why would you need to hold back WAL writes? WAL is written on master
> first and then steamed to slave as it is done now. You would only need
> hold back dirty page evictions having a recent enough LSN to not yet
> be replicated. This holding back is already done to wait for local WAL
> flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
> it's usage count gets bumped, so it will not be considered for
> eviction for at least one clocksweep cycle. In normal circumstances
> that will be enough time to get an ACK from the slave. When WAL is
> generated at an higher rate than can be replicated this will not be
> true. In that case backends that need to bring in new pages will have
> to wait for WAL to be replicated before they can continue. That will
> hopefully include the backends that are doing the dirtying, throttling
> the WAL generation rate. This would definitely be optional behavior,
> not something turned on by default.
>
>>>
>>>> And it does just turn around "master is ahead of slave" problem
>>>> into "slave is ahead of master" problem :)
>>>
>>> The issue is not being ahead or behind. The issue is ensuring WAL
>>> durability in the face of failovers before modifying data pages. This
>>> is sufficient to guarantee no forks in the WAL stream from the point
>>> of view of data files and with that the capability to always recover
>>> by replaying WAL.
>>
>> How would this handle the case Tom pointed out, namely a short
>> power recycling on master ?
>>
>> Instead of just continuing after booting up again the master now
>> has to figure out if it had any slaves and then try to query them
>> (for how long?) if they had any replayed WAL the master does
>> not know of.
>
> If the master is restarted and there is no failover to the slave, then
> nothing strange would happen, master does recovery, comes up and
> starts streaming to the slave again. If there is a failover, then
> whatever is managing the failover needs to ensure that the master does
> not come up again on its own before it is reconfigured as a slave.
> This is what HA cluster managers do.
>
>> Suddenly the pure existence of streaming replica slaves has become
>> a problem for master !
>>
>> This will especially complicate the case of multiple slaves each
>> having received WAL to a slightly different LSN ? And you do want
>> to have at least 2 slaves if you want both durability
>> and availability with syncrep.
>>
>> What if the one of slaves disconnects ? how should master react to this ?
>
> Again, WAL replication will be the same as it is now. Availability
> considerations, including what to do when slaves go away, are the same
> as for current sync replication. Only required change is that we can
> configure the master to hold out on writing any data pages that
> contain changes that might go missing in the case of a failover.
>
> Whether the additional complexity is worth the feature is a matter of
> opinion. As we have no patch yet I can't say that I know what all the
> implications are, but at first glance the complexity seems rather
> compartmentalized. This would only amend what the concept of a WAL
> flush considers safely flushed.

I really share the same view with you!

Regards,

--
Fujii Masao

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-04-11 18:08:47 Viewing new 9.3 error fields
Previous Message Fujii Masao 2013-04-11 17:29:01 Re: Inconsistent DB data in Streaming Replication