Re: Patch for fail-back without fresh backup

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Samrat Revagade <revagade(dot)samrat(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch for fail-back without fresh backup
Date: 2013-06-17 08:03:01
Message-ID: CABOikdNnfgy-1Px8=_AeWjwnP+rSQ+fYNJYVn_iqKkXZF_EiOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>
>
> My perspective is that if the master crashed, assuming that you know
> everything about that and suddenly jumping back on seem like a recipe
> for disaster. Attempting that is currently blocked by the technical
> obstacles you've identified, but that doesn't mean they are the only
> ones - we don't yet understand what all the problems lurking might be.
> Personally, I won't be following you onto that minefield anytime soon.
>
>
Would it be fair to say that a user will be willing to trust her crashed
master in all scenarios where she would have done so in a single instance
setup ? IOW without the replication setup, AFAIU users have traditionally
trusted the WAL recovery to recover from failed instances. This would
include some common failures such as power outages and hardware failures,
but may not include others such as on disk corruption.

> So I strongly object to calling this patch anything to do with
> "failback safe". You simply don't have enough data to make such a bold
> claim. (Which is why we call it synchronous replication and not "zero
> data loss", for example).
>
>
I agree. We should probably find a better name for this. Any suggestions ?

> But that's not the whole story. I can see some utility in a patch that
> makes all WAL transfer synchronous, rather than just commits. Some
> name like synchronous_transfer might be appropriate. e.g.
> synchronous_transfer = all | commit (default).
>
>
Its an interesting idea, but I think there is some difference here. For
example, the proposed feature allows a backend to wait at other points but
not commit. Since commits are more foreground in nature and this feature
does not require us to wait during common foreground activities, we want a
configuration where master can wait for synchronous transfers at other than
commits. May we can solve that by having more granular control to the said
parameter ?

> The idea of another slew of parameters that are very similar to
> synchronous replication but yet somehow different seems weird. I can't
> see a reason why we'd want a second lot of parameters. Why not just
> use the existing ones for sync rep? (I'm surprised the Parameter
> Police haven't visited you in the night...) Sure, we might want to
> expand the design for how we specify multi-node sync rep, but that is
> a different patch.
>

How would we then distinguish between synchronous and the new kind of
standby ? I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.

>
> I'm worried to see that adding this feature and yet turning it off
> causes a measureable drop in performance. I don't think we want that
> at all. That clearly needs more work and thought.
>
>
I agree. We need to repeat those tests. I don't trust that turning the
feature is causing 1-2% drop. In one of the tests, I see turning the
feature on is showing better number compared to when its turn off. That's
clearly noise or need concrete argument to convince that way.

> I also think your performance results are somewhat bogus. Fast
> transaction workloads were already mostly commit waits -

But not in case of async standby, right ?

> measurements
> of what happens to large loads, index builds etc would likely reveal
> something quite different.
>
>
I agree. I also feel we need tests where the FlushBuffer gets called more
often by the normal backends to see how much added wait in that code path
causes performance drops. Another important thing to test would be to see
how it works on a slower/high latency links.

> I'm tempted by the thought that we should put the WaitForLSN inside
> XLogFlush, rather than scatter additional calls everywhere and then
> have us inevitably miss one.
>
>
That indeed seems cleaner.

Thanks,
Pavan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2013-06-17 08:06:01 Re: SLRU
Previous Message Soroosh Sardari 2013-06-17 07:52:50 SLRU