Re: Allow "snapshot too old" error, to prevent bloat

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Allow "snapshot too old" error, to prevent bloat
Date: 2015-02-19 23:39:45
Message-ID: 20150219233945.GI6717@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Kevin,

* Kevin Grittner (kgrittn(at)ymail(dot)com) wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > Kevin Grittner (kgrittn(at)ymail(dot)com) wrote:
> >> (1) They have a pool of connections each of which can have several
> >> long-running cursors, so the limit from that isn't just doubling
> >> the size of their database, it is limiting it to some two-or-three
> >> digit multiple of the necessary size.
> >
> > This strikes me as a bit off-the-cuff; was an analysis done which
> > deteremined that would be the result?
>
> If it sounds like off-the-cuff it is because I am summarizing the
> mountain of data which has been accumulated over weeks of
> discussions and many rounds of multi-day tests using their actual
> software driven by a software test driver that simulates users,
> with "think time" and all.

My apologies for, unintentionally, implying that you hadn't done
sufficient research or testing. That certainly was *not* my intent and
it's absolutely clear to me that you've spent quite a bit of time
analyzing this problem.

What I was trying to get at is that the approach which I outlined was,
perhaps, not closely analyzed. Now, analyzing a problem based on a
solution which doesn't actually exist isn't exactly trivial, and while
it might be possible to do in this case it'd probably be easier to
simply write the code and the review the results than perform an
independent analysis. I do feel that the solution might have been
dismissed on the basis of "doubling the database for each long-running
transaction is not acceptable" or even "having bloat because of a
long-running transaction is not acceptable," but really analyzing what
would actually happen is a much more difficult exercise.

> With the two patches I submitted, bloat stabilized at less than 5%
> except for some btree indexes which followed pattern of inserting
> at the end and deleting most (but not all) of the entries over
> time. I've been working on a separate patch for that, but it's not
> ready to present, so I probably won't post that until the first CF
> of the next release -- unless someone's interested enough to want
> to discuss it before then.

Understood. I'd encourage you to post your thoughts on it- I'm
certainly interested in it, even if it isn't something we'll be able to
really work on for 9.5.

> > That, really, is the crux here- they've already got support for dealing
> > with it the way Oracle did and they'd like PG to do that too.
> > Unfortunately, that, by itself, isn't a good reason for a particular
> > capability (we certainly aren't going to be trying to duplicate PL/SQL
> > in PG any time soon).
>
> I understand that, and if the community doesn't want this style of
> limitation on bloat we can make it part of PPAS (which *does*
> duplicate PL/SQL, among other things). We are offering it to the
> community first.

That is certainly appreciated and I'd encourage you to continue to do
so. I do feel that the community, in general, is interested in these
kinds of knobs- it's just a question of implemenation and making sure
that it's a carefully considered solution and not a knee-jerk reaction
to match what another RDBMS does.

> >>> I'm not against having a knob like this, which is defaulted to off,
> >>
> >> Thanks! I'm not sure that amounts to a +1, but at least it doesn't
> >> sound like a -1. :-)
> >
> > So, at the time I wrote that, I wasn't sure if it was a +1 or not
> > myself. I've been thinking about it since then, however, and I'm
> > leaning more towards having the capability than not, so perhaps that's a
> > +1, but it doesn't excuse the need to come up with an implementation
> > that everyone can be happy with and what you've come up with so far
> > doesn't have a lot of appeal, based on the feedback (I've only glanced
> > through it myself, but I agree with Andres and Tom that it's a larger
> > footprint than we'd want for this and the number of places having to be
> > touched is concerning as it could lead to future bugs).
>
> This conversation has gotten a little confusing because some of my
> posts seem to have been held up in the email systems for some
> reason, and are arriving out-of-order (and at least one I don't see
> yet). But I have been responding to these points. For one thing I
> stated four days ago that I now feel that the metric for
> determining that a snapshot is "old" should be *time*, not
> transactions IDs consumed, and offered to modify that patch in that
> direction if people agreed. I now see one person agreeing and
> several people arguing that I should do just that (as though they
> had not seem me propose it), so I'll try to get a new version out
> today like that. In any event, I am sure it is possible and feel
> that it would be better. (Having posted that now at least 4 times,
> hopefully people will pick up on it.)

I agree with others that having it be time-based would be better. I
did see that but it's really an independent consideration from the
footprint of the patch.

> For another thing, as mentioned earlier, this is the btree footprint:
>
> src/backend/access/nbtree/nbtinsert.c | 7 ++++---
> src/backend/access/nbtree/nbtpage.c | 2 +-
> src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++---------
> src/include/access/nbtree.h | 7 ++++---
> 4 files changed, 43 insertions(+), 16 deletions(-)

Understood- but that's not really the issue here, it's...

> That consists mainly of two things. First, there are 13 places
> (all in nbtsearch.c) immediately following a call to
> BufferGetPage() that a line like this was added:
>
> TestForOldSnapshot(snapshot, rel, page);
>

This. That's quite a few different places which had to be updated and
while they are obvious now, it might not be obvious in the future which
BufferGetPage() calls need that follows TestForOldSnapshot() call.

> Second, some functions that were not previously passed a snapshot
> had that added to the function signature and the snapshot was
> passed from the caller where needed. C files other than nbtsearch
> were only modified to pass a NULL in to calls to the functions with
> modified signatures -- in a total of four places.

This part doesn't bother me nearly as much as having to add those tests
into a lot of different places in the code, and having to get the index
AMs involved.

> > A lot of that would go away if there was a way to avoid having to mess
> > with the index AMs, I'd think, but I wonder if we'd actually need more
> > done there- it's not immediately obvious to me how an index-only scan is
> > safe with this. Whenever an actual page is visited, we can check the
> > LSN, and the relation can't be truncated by vacuum since the transaction
> > will still have a lock on the table which prevents it, but does the
> > visibility-map update check make sure to never mark pages all-visible
> > when one of these old transactions is running around? On what basis?
>
> I will review that; it may indeed need some attention. The index
> entry is removed before the line pointer can be freed, but the
> question is whether there is a window where pruning or vacuum has
> removed the tuple and changed the page visibility map bit to say
> that everything on the page is visible to all before the index
> entry is removed. There may be a need to, as you say, suppress
> setting the all-visible flag in some particular cases.

Right, there may be a period of time where the flag is set to
all-visible and the index entry still exists.

Isn't there a concern about the index entry disappearing also though?
If there is a specific query against a particular value in the index and
the index no longer has that value then the query would return zero
results- but this could happen without visiting the heap, right?
Therefore, how do you know that the value was, or wasn't, visible at the
time that the original long-running snapshot was taken.

> >>> but I do think we'd be a lot better off with a system that could
> >>> realize when rows are not visible to any currently running transaction
> >>> and clean them up.
> >>
> >> +1
> >>
> >> But they are not mutually exclusive; I see them as complementary.
> >
> > I can see how they would be, provided we can be confident that we're
> > going to actually throw an error when the snapshot is out of date and
> > not end up returning incorrect results. We need to be darn sure of
> > that, both now and in a few years from now when many of us may have
> > forgotten about this knob.. ;)
>
> I get that this is not zero-cost to maintain, and that it might not
> be of interest to the community largely for that reason. If you
> have any ideas about how to improve that, I'm all ears. But do
> consider the actual scope of what was needed for the btree coverage
> (above). (Note that the index-only scan issue doesn't look like
> anything AM-specific; if something is needed for that it would be
> in the pruning/vacuum area.)

Maintenance is certainly one concern among many but I'm not sure that
we're quite ready to write this capability off due to the
maintainability of this particular patch- it needs to be considered
against the alternatives proposed and the risk of breaking this
particular capability in the future. I agree that the index issue is
not AM-specific but it's still a consideration to consider.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-02-20 01:26:45 Idea: closing the loop for "pg_ctl reload"
Previous Message Adam Brightwell 2015-02-19 23:08:09 Re: CATUPDATE confusion?