Hot Standby query cancellation and Streaming Replication integration

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Hot Standby query cancellation and Streaming Replication integration
Date: 2010-02-26 08:33:35
Message-ID: 4B87875F.2010601@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm happy to see we've crossed the point where the worst of the Hot
Standby and Streaming Replication issues are sorted out. A look at the
to-do lists: http://wiki.postgresql.org/wiki/Hot_Standby_TODO
http://wiki.postgresql.org/wiki/Streaming_Replication show no Must-fix
items and 5 Serious Issues for Hot Standby left; there are 9 Streaming
Replication items there, which aren't as clearly prioritized yet.
Correct me if I'm wrong here, but those read to me like tweaks and
polishing rather than major architecture issues at this point, so I
believe that code is the same position as HS: some serious issues, but
no really terrible parts.

The job Simon asked me to take a look at starting last week is which of
the listed HS "Serious Issues" might be promoted into must-fix items
after seeing how easy they were to encounter. There are a number of HS
tunables that interact with one another, and depending your priorities a
few ways you can try to optimize the configuration for what I expect to
be common use cases for this feature. I've written a blog entry at
http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html
that tries to explain all that background clearly, and relate the
implementation details to how I expect DBAs will perceive them. That
was a bit much to also cover here, and had a broader audience that might
appreciate it than just this list.

Attached is a tar file with some test case demo scripts that demonstrate
the worst of the problems here IMHO. A README in there outlines how to
set the problem demo up (presuming you've already gotten a HS pair
going). What this does is execute the following sequence continuously
on the master:

UPDATE pgbench_tellers SET tbalance = tbalance + <delta> WHERE tid =
<tid>; (several times)
VACUUM pgbench_tellers;

Meanwhile, on the standby, the following long query runs on a few
sessions at once, again looping constantly:

SELECT sum(abalance) FROM pgbench_accounts;

It took a bit of testing to get the database scale and iteration times
here to easily encounter the issue here on my system, I hope this shows
up easily enough for others with the values used. (I have a similar
work in progress demo that tries to trigger the b-tree deletion problem
too, will follow up once the storm of messages about this topic dies
down, as I think this is a pre-requisite for it anyway)

I'm not sure what you might be expecting from the above combination, but
what actually happens is that many of the SELECT statements on the table
*that isn't even being updated* are canceled. You see this in the logs:

LOG: restored log file "0000000100000000000000A5" from archive
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be
removed.
STATEMENT: SELECT sum(abalance) FROM pgbench_accounts;

Basically, every time a WAL segment appears that wipes out a tuple that
SELECT expects should still be visible, because the dead row left behind
by the update has been vacuumed away, the query is canceled. This
happens all the time the way I've set this up, and I don't feel like
this is a contrived demo. Having a long-running query on the standby
while things get updated and then periodically autovacuumed on the
primary is going to be extremely common in the sorts of production
systems I expect want HS the most.

Now, as explained on the blog entry and in the documentation, there are
all sorts of ways you can work around this issue by tweaking parameters
or doing fun tricks with dblink. You can prioritize any two of keeping
the standby current, letting long-running queries execute on the
standby, and keeping xid advances on the master moving forward as fast
as possible. But you can't get all three at once. The choices
available are really about the best you can do given a system that's
basically the old warm-standby approach, improved with adding just Hot
Standby to the mix. Sure, you might make the conflict resolution a bit
smarter or make the UI for setting the parameters more friendly, and
there's already been plenty of argument and patching over all of that.
I don't want to belittle that work because it's been important to make
HS a useful standalone feature, but I feel like that's all triage rather
than looking for the most robust fix possible.

If you're running a system that also is using Streaming Replication,
there is a much better approach possible. This idea has been floating
around for a while and I am not taking credit for inventing it (too busy
tonight to dig into the archives to figure out exactly when this popped
up initially and who deserves credit for it). I'm just pointing out
that now is the time where it's actually possible to implement. The HS
TODO already includes the following action item, to resolve a serious
issue you can run into (that itself would be great to eliminate):

"Requires keep-alives with timestamps to be added to sync rep feature"

If those keep-alives flowed in both directions, and included both
timestamps *and* xid visibility information, the master could easily be
configured to hold open xid snapshots needed for long running queries on
the standby when that was necessary. I might be missing an
implementation detail here, but from a high level it seems like you
could make the walreceiver on the master publish the information about
where the standby has advanced to as a bit of ProcArray xmin data. Then
the master could only advance past where the standby says it cannot need
visibility behind anymore.

This is a much more elegant solution than any of the hacks available so
far. It would turn Hot Standby + Streaming Replication into a system
that stepped out of the way of the worst of the technical limitations of
HS alone. The master would easily and automatically avoid advancing
past where the queries running on the standby needed visibility back to,
essentially the same way cleanup is blocked during a long-running query
on the primary--except with the actual main query work offloaded to the
standby, the idea all along.

I don't know how difficult the keepalive feature was expected to be, and
there's certainly plenty of potential landmines in this whole xid export
idea. How to handle situations where the standby goes away for a while,
such as a network outage, so that it doesn't block the master from ever
cleaning up dead tuples is a concern. I wouldn't expect that to be too
serious of a blocker, given that if the standby isn't talking it
probably isn't running queries you need to worry about canceling
either. Not sure where else this can fall down, and unfortunately I
don't know nearly enough about the SR code to help myself with
implementing this feature. (I think Simon is in a similar
position--it's just not what we've been staring at the last few months).

But I do know that the current Hot Standby implementation is going to be
frustrating to configure correctly for people. If it's possible to make
most of that go away just by doing some final integration between it and
Streaming Replication that just wasn't practical to accomplish until
now, I think it's worth considering how to make that happen before the
final 9.0 release.

I really hope this discussion can say focused on if and how it's
possible to improve this area, with the goal being to deliver a product
everyone can be proud of with the full feature set that makes this next
release a killer one. The features that have managed to all get into
this release already are fantastic, everyone who contributed should be
proud of that progress, and it's encouraging that the alpha4 date was
nailed. It would be easy to descend into finger-pointing for why
exactly this particular problem is only getting more visibility now, or
into schedule-oriented commentary suggesting it must be ignored because
it's too late to do anything about it. I hope everyone appreciates
wandering that way will not help make PostgreSQL 9.0 a better release.
This issue is so easy to encounter, and looks so bad when it happens,
that I feel it could easily lead to an embarrassing situation for the
community if something isn't done about it before release.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us

Attachment Content-Type Size
hs-demo.tar.gz application/x-gzip 1.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gokulakannan Somasundaram 2010-02-26 08:36:22 Re: A thought on Index Organized Tables
Previous Message Piyush Newe 2010-02-26 08:30:05 Correcting Error message