Skip site navigation (1) Skip section navigation (2)

Hot Standby query cancellation and Streaming Replication integration

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Hot Standby query cancellation and Streaming Replication integration
Date: 2010-02-26 08:33:35
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
I'm happy to see we've crossed the point where the worst of the Hot 
Standby and Streaming Replication issues are sorted out.  A look at the 
to-do lists: show no Must-fix 
items and 5 Serious Issues for Hot Standby left; there are 9 Streaming 
Replication items there, which aren't as clearly prioritized yet.  
Correct me if I'm wrong here, but those read to me like tweaks and 
polishing rather than major architecture issues at this point, so I 
believe that code is the same position as HS:  some serious issues, but 
no really terrible parts.

The job Simon asked me to take a look at starting last week is which of 
the listed HS "Serious Issues" might be promoted into must-fix items 
after seeing how easy they were to encounter.  There are a number of HS 
tunables that interact with one another, and depending your priorities a 
few ways you can try to optimize the configuration for what I expect to 
be common use cases for this feature.  I've written a blog entry at 
that tries to explain all that background clearly, and relate the 
implementation details to how I expect DBAs will perceive them.  That 
was a bit much to also cover here, and had a broader audience that might 
appreciate it than just this list.

Attached is a tar file with some test case demo scripts that demonstrate 
the worst of the problems here IMHO.  A README in there outlines how to 
set the problem demo up (presuming you've already gotten a HS pair 
going).  What this does is execute the following sequence continuously 
on the master:

UPDATE pgbench_tellers SET tbalance = tbalance + <delta> WHERE tid = 
<tid>; (several times)
VACUUM pgbench_tellers;

Meanwhile, on the standby, the following long query runs on a few 
sessions at once, again looping constantly:

SELECT sum(abalance) FROM pgbench_accounts;

It took a bit of testing to get the database scale and iteration times 
here to easily encounter the issue here on my system, I hope this shows 
up easily enough for others with the values used.  (I have a similar 
work in progress demo that tries to trigger the b-tree deletion problem 
too, will follow up once the storm of messages about this topic dies 
down, as I think this is a pre-requisite for it anyway)

I'm not sure what you might be expecting from the above combination, but 
what actually happens is that many of the SELECT statements on the table 
*that isn't even being updated* are canceled.  You see this in the logs:

LOG:  restored log file "0000000100000000000000A5" from archive
ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be 
STATEMENT:  SELECT sum(abalance) FROM pgbench_accounts;

Basically, every time a WAL segment appears that wipes out a tuple that 
SELECT expects should still be visible, because the dead row left behind 
by the update has been vacuumed away, the query is canceled.  This 
happens all the time the way I've set this up, and I don't feel like 
this is a contrived demo.  Having a long-running query on the standby 
while things get updated and then periodically autovacuumed on the 
primary is going to be extremely common in the sorts of production 
systems I expect want HS the most.

Now, as explained on the blog entry and in the documentation, there are 
all sorts of ways you can work around this issue by tweaking parameters 
or doing fun tricks with dblink.  You can prioritize any two of keeping 
the standby current, letting long-running queries execute on the 
standby, and keeping xid advances on the master moving forward as fast 
as possible.  But you can't get all three at once.  The choices 
available are really about the best you can do given a system that's 
basically the old warm-standby approach, improved with adding just Hot 
Standby to the mix.  Sure, you might make the conflict resolution a bit 
smarter or make the UI for setting the parameters more friendly, and 
there's already been plenty of argument and patching over all of that.  
I don't want to belittle that work because it's been important to make 
HS a useful standalone feature, but I feel like that's all triage rather 
than looking for the most robust fix possible.

If you're running a system that also is using Streaming Replication, 
there is a much better approach possible.  This idea has been floating 
around for a while and I am not taking credit for inventing it (too busy 
tonight to dig into the archives to figure out exactly when this popped 
up initially and who deserves credit for it).  I'm just pointing out 
that now is the time where it's actually possible to implement.  The HS 
TODO already includes the following action item, to resolve a serious 
issue you can run into (that itself would be great to eliminate):

"Requires keep-alives with timestamps to be added to sync rep feature"

If those keep-alives flowed in both directions, and included both 
timestamps *and* xid visibility information, the master could easily be 
configured to hold open xid snapshots needed for long running queries on 
the standby when that was necessary. I might be missing an 
implementation detail here, but from a high level it seems like you 
could make the walreceiver on the master publish the information about 
where the standby has advanced to as a bit of ProcArray xmin data.  Then 
the master could only advance past where the standby says it cannot need 
visibility behind anymore.

This is a much more elegant solution than any of the hacks available so 
far.  It would turn Hot Standby + Streaming Replication into a system 
that stepped out of the way of the worst of the technical limitations of 
HS alone.  The master would easily and automatically avoid advancing 
past where the queries running on the standby needed visibility back to, 
essentially the same way cleanup is blocked during a long-running query 
on the primary--except with the actual main query work offloaded to the 
standby, the idea all along.

I don't know how difficult the keepalive feature was expected to be, and 
there's certainly plenty of potential landmines in this whole xid export 
idea.  How to handle situations where the standby goes away for a while, 
such as a network outage, so that it doesn't block the master from ever 
cleaning up dead tuples is a concern.  I wouldn't expect that to be too 
serious of a blocker, given that if the standby isn't talking it 
probably isn't running queries you need to worry about canceling 
either.  Not sure where else this can fall down, and unfortunately I 
don't know nearly enough about the SR code to help myself with 
implementing this feature.  (I think Simon is in a similar 
position--it's just not what we've been staring at the last few months).

But I do know that the current Hot Standby implementation is going to be 
frustrating to configure correctly for people.  If it's possible to make 
most of that go away just by doing some final integration between it and 
Streaming Replication that just wasn't practical to accomplish until 
now, I think it's worth considering how to make that happen before the 
final 9.0 release.

I really hope this discussion can say focused on if and how it's 
possible to improve this area, with the goal being to deliver a product 
everyone can be proud of with the full feature set that makes this next 
release a killer one. The features that have managed to all get into 
this release already are fantastic, everyone who contributed should be 
proud of that progress, and it's encouraging that the alpha4 date was 
nailed.  It would be easy to descend into finger-pointing for why 
exactly this particular problem is only getting more visibility now, or 
into schedule-oriented commentary suggesting it must be ignored because 
it's too late to do anything about it.  I hope everyone appreciates 
wandering that way will not help make PostgreSQL 9.0 a better release.  
This issue is so easy to encounter, and looks so bad when it happens, 
that I feel it could easily lead to an embarrassing situation for the 
community if something isn't done about it before release.

Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support

Attachment: hs-demo.tar.gz
Description: application/x-gzip (1.5 KB)


pgsql-hackers by date

Next:From: Gokulakannan SomasundaramDate: 2010-02-26 08:36:22
Subject: Re: A thought on Index Organized Tables
Previous:From: Piyush NeweDate: 2010-02-26 08:30:05
Subject: Correcting Error message

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group