Re: Re: Hot Standby query cancellation and Streaming Replication integration

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Hot Standby query cancellation and Streaming Replication integration
Date: 2010-03-02 01:34:53
Message-ID: 4B8C6B3D.1060107@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Josh Berkus wrote:
> However, this leaves aside Greg's point about snapshot age and
> successive queries; does anyone dispute his analysis? Simon?
>

There's already a note on the Hot Standby TODO about unexpectly bad
max_standby_delay behavior being possible on an idle system, with no
suggested resolution for it besides better SR integration. The issue
Greg Stark has noted is another variation on that theme. It's already
on my list of theorized pathological but as yet undemonstrated concerns
that Simon and I identified, the one I'm working through creating a test
cases to prove/disprove. I'm past "it's possible..." talks at this
point though as not to spook anyone unnecessarily, and am only raising
things I can show concrete examples of in action. White box testing at
some point does require pausing one's investigation of what's in the box
and getting on with the actual testing instead.

The only real spot where my opinion diverges here that I have yet to
find any situation where 'max_standby_delay=-1' makes any sense to me.
When I try running my test cases with that setting, the whole system
just reacts far too strangely. My first patch here is probably going to
be adding more visibility into the situation when queries are blocking
replication forever, because I think the times I find myself at "why is
the system hung right now?" are when that happens and it's not obvious
as an admin what's going on.

Also, the idea that a long running query on the standby could cause an
unbounded delay in replication is so foreign to my sensibilities that I
don't ever include it in the list of useful solutions to the problems
I'm worried about. The option is there, not disputing that it makes
sense for some people because there seems some demand for it, just can't
see how it fits into any of the use-cases I'm concerned about.

I haven't said anything about query retry mainly because I can't imagine
any way it's possible to build it in time for this release, so whether
it's eventually feasible or not doesn't enter into what I'm worried
about right now. In any case, I would prioritize that behind work on
preventing the most common situations that cause cancellations in the
first place, until those are handled so well that retry is the most
effective improvement left to consider.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ed L. 2010-03-02 01:48:15 Re: [SOLVED] Re: Hung postmaster (8.3.9)
Previous Message Ed L. 2010-03-02 01:31:52 Re: Hung postmaster (8.3.9)