Re: Inconsistent DB data in Streaming Replication

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Sameer Thakur <samthakur74(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, sthomas(at)optionshouse(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Samrat Revagade <revagade(dot)samrat(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: Inconsistent DB data in Streaming Replication
Date: 2013-04-12 05:48:01
Message-ID: CABOikdPvCfbdkd+jexwgqUMyKO=aquXkTy=b2pJuEiyjwY-gxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma <ants(at)cybertec(dot)at> wrote:

> On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com>
> wrote:
> > On 04/11/2013 03:52 PM, Ants Aasma wrote:
> >>
> >> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu(at)2ndquadrant(dot)com>
> >> wrote:
> >>>
> >>> The proposed fix - halting all writes of data pages to disk and
> >>> to WAL files while waiting ACK from standby - will tremendously
> >>> slow down all parallel work on master.
> >>
> >> This is not what is being proposed. The proposed fix halts writes of
> >> only data pages that are modified within the window of WAL that is not
> >> yet ACKed by the slave. This means pages that were recently modified
> >> and where the clocksweep or checkpoint has decided to evict them. This
> >> only affects the checkpointer, bgwriter and backends doing allocation.
> >> Furthermore, for the backend clocksweep case it would be reasonable to
> >> just pick another buffer to evict. The slowdown for most actual cases
> >> will be negligible.
> >
> > You also need to hold back all WAL writes, including the ones by
> > parallel async and locally-synced transactions. Which means that
> > you have to make all locally synced transactions to wait on the
> > syncrep transactions committed before them.
> > After getting the ACK from slave you then have a backlog of stuff
> > to write locally, which then also needs to be sent to slave. Basically
> > this turns a nice smooth WAL write-and-stream pipeline into a
> > chunky wait-and-write-and-wait-and-stream-and-wait :P
> > This may not be a problem in slight write load cases, which is
> > probably the most widely happening usecase for postgres, but it
> > will harm top performance and also force people to get much
> > better (and more expensive) hardware than would otherways
> > be needed.
>
> Why would you need to hold back WAL writes? WAL is written on master
> first and then steamed to slave as it is done now. You would only need
> hold back dirty page evictions having a recent enough LSN to not yet
> be replicated. This holding back is already done to wait for local WAL
> flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
> it's usage count gets bumped, so it will not be considered for
> eviction for at least one clocksweep cycle. In normal circumstances
> that will be enough time to get an ACK from the slave. When WAL is
> generated at an higher rate than can be replicated this will not be
> true. In that case backends that need to bring in new pages will have
> to wait for WAL to be replicated before they can continue. That will
> hopefully include the backends that are doing the dirtying, throttling
> the WAL generation rate. This would definitely be optional behavior,
> not something turned on by default.
>
>
I agree. I don't think the proposes change would cause a lot of performance
bottleneck since the proposal is to hold back writing of dirty pages until
the WAL is replicated successfully to the standby. The heap pages are
mostly written by the background threads often much later than the WAL for
the change is written. So in all likelihood, there will be no wait
involved. Of course, this will not be true for very frequently updated
pages that must be written at a checkpoint.

But I wonder if the problem is really limited to the heap pages ? Even for
something like a CLOG page, we will need to ensure that the WAL records are
replayed before the page is written to the disk. Same is true for relation
truncation. In fact, all places where the master needs to call XLogFlush()
probably needs to be examined to decide if the subsequent action has a
chance to leave the database corrupt and ensure that the WAL is replicated
before proceeding with the change.

Tom has a very valid concern from the additional code complexity point of
view though I disagree that its always good idea to start with a fresh
rsync. If we can avoid that with right checks, I don't see why we should
not improve the downtime for the master. Its very likely that the standby
may not be as good a server as the master is and the user would want to
quickly switch back to the master for performance reasons. To reduce
complexity, can we do this as some sort of plugin for XLogFlush() which
gets to know that XLogFlush has been upto the given LSN and the event that
caused the function to be called ? We can then leave the handling of the
even to the implementer. This will also avoid any penalty for those who are
happy with the current mechanism and do not want any complex HA setups.

Thanks,
Pavan

http://www.linkedin.com/in/pavandeolasee

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2013-04-12 06:14:10 Re: (auto)vacuum truncate exclusive lock
Previous Message Jeff Janes 2013-04-12 04:02:49 (auto)vacuum truncate exclusive lock