Re: Allow "snapshot too old" error, to prevent bloat

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Allow "snapshot too old" error, to prevent bloat
Date: 2015-02-19 16:01:33
Message-ID: 68410939.1739882.1424361693462.JavaMail.yahoo@mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> Kevin Grittner (kgrittn(at)ymail(dot)com) wrote:
>> Stephen Frost <sfrost(at)snowman(dot)net> wrote:

>>> In the end, with a single long-running transaction, the worst bloat you
>>> would have is double the size of the system at the time the long-running
>>> transaction started.
>>
>> I agree that limiting bloat to one dead tuple for every live one
>> for each old snapshot is a limit that has value, and it was unfair
>> of me to characterize that as not being a limit. Sorry for that.
>>
>> This possible solution was discussed with the user whose feedback
>> caused me to write this patch, but there were several reasons they
>> dismissed that as a viable total solution for them, two of which I
>> can share:
>>
>> (1) They have a pool of connections each of which can have several
>> long-running cursors, so the limit from that isn't just doubling
>> the size of their database, it is limiting it to some two-or-three
>> digit multiple of the necessary size.
>
> This strikes me as a bit off-the-cuff; was an analysis done which
> deteremined that would be the result?

If it sounds like off-the-cuff it is because I am summarizing the
mountain of data which has been accumulated over weeks of
discussions and many rounds of multi-day tests using their actual
software driven by a software test driver that simulates users,
with "think time" and all.

To be usable, we need to run on a particular set of hardware with a
simulated load of 3000 users. In a two-day test without the
patches I'm proposing, their actual, working production software
running against PostgreSQL bloated the database by 37% and was on a
linear rate of increase. Unpatched, CPU time consumed at a linear
rate due to the bloat until in the second day it saturated the CPUs
and the driver was unable to get acceptable user performance.
Because of that we haven't moved on to test what the bloat rate
would be like with 3000 users, but I assume it would be higher than
with 300 users.

With the two patches I submitted, bloat stabilized at less than 5%
except for some btree indexes which followed pattern of inserting
at the end and deleting most (but not all) of the entries over
time. I've been working on a separate patch for that, but it's not
ready to present, so I probably won't post that until the first CF
of the next release -- unless someone's interested enough to want
to discuss it before then.

>> (2) They are already prepared to deal with "snapshot too old"
>> errors on queries that run more than about 20 minutes and which
>> access tables which are modified. They would rather do that than
>> suffer the bloat beyond that point.
>
> That, really, is the crux here- they've already got support for dealing
> with it the way Oracle did and they'd like PG to do that too.
> Unfortunately, that, by itself, isn't a good reason for a particular
> capability (we certainly aren't going to be trying to duplicate PL/SQL
> in PG any time soon).

I understand that, and if the community doesn't want this style of
limitation on bloat we can make it part of PPAS (which *does*
duplicate PL/SQL, among other things). We are offering it to the
community first.

> That said, there are capabilities in other RDBMS's which are
> valuable and which we *do* want, so the fact that Oracle does this
> also isn't a reason to not include it.

:-)

>>> I'm not against having a knob like this, which is defaulted to off,
>>
>> Thanks! I'm not sure that amounts to a +1, but at least it doesn't
>> sound like a -1. :-)
>
> So, at the time I wrote that, I wasn't sure if it was a +1 or not
> myself. I've been thinking about it since then, however, and I'm
> leaning more towards having the capability than not, so perhaps that's a
> +1, but it doesn't excuse the need to come up with an implementation
> that everyone can be happy with and what you've come up with so far
> doesn't have a lot of appeal, based on the feedback (I've only glanced
> through it myself, but I agree with Andres and Tom that it's a larger
> footprint than we'd want for this and the number of places having to be
> touched is concerning as it could lead to future bugs).

This conversation has gotten a little confusing because some of my
posts seem to have been held up in the email systems for some
reason, and are arriving out-of-order (and at least one I don't see
yet). But I have been responding to these points. For one thing I
stated four days ago that I now feel that the metric for
determining that a snapshot is "old" should be *time*, not
transactions IDs consumed, and offered to modify that patch in that
direction if people agreed. I now see one person agreeing and
several people arguing that I should do just that (as though they
had not seem me propose it), so I'll try to get a new version out
today like that. In any event, I am sure it is possible and feel
that it would be better. (Having posted that now at least 4 times,
hopefully people will pick up on it.)

For another thing, as mentioned earlier, this is the btree footprint:

src/backend/access/nbtree/nbtinsert.c | 7 ++++---
src/backend/access/nbtree/nbtpage.c | 2 +-
src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++---------
src/include/access/nbtree.h | 7 ++++---
4 files changed, 43 insertions(+), 16 deletions(-)

That consists mainly of two things. First, there are 13 places
(all in nbtsearch.c) immediately following a call to
BufferGetPage() that a line like this was added:

TestForOldSnapshot(snapshot, rel, page);

Second, some functions that were not previously passed a snapshot
had that added to the function signature and the snapshot was
passed from the caller where needed. C files other than nbtsearch
were only modified to pass a NULL in to calls to the functions with
modified signatures -- in a total of four places.

That is it.

To be sure I had to initially spend some time figuring out which
BufferGetPage() calls needed to be followed with this test, but
it's not rocket science.

> A lot of that would go away if there was a way to avoid having to mess
> with the index AMs, I'd think, but I wonder if we'd actually need more
> done there- it's not immediately obvious to me how an index-only scan is
> safe with this. Whenever an actual page is visited, we can check the
> LSN, and the relation can't be truncated by vacuum since the transaction
> will still have a lock on the table which prevents it, but does the
> visibility-map update check make sure to never mark pages all-visible
> when one of these old transactions is running around? On what basis?

I will review that; it may indeed need some attention. The index
entry is removed before the line pointer can be freed, but the
question is whether there is a window where pruning or vacuum has
removed the tuple and changed the page visibility map bit to say
that everything on the page is visible to all before the index
entry is removed. There may be a need to, as you say, suppress
setting the all-visible flag in some particular cases.

>>> but I do think we'd be a lot better off with a system that could
>>> realize when rows are not visible to any currently running transaction
>>> and clean them up.
>>
>> +1
>>
>> But they are not mutually exclusive; I see them as complementary.
>
> I can see how they would be, provided we can be confident that we're
> going to actually throw an error when the snapshot is out of date and
> not end up returning incorrect results. We need to be darn sure of
> that, both now and in a few years from now when many of us may have
> forgotten about this knob.. ;)

I get that this is not zero-cost to maintain, and that it might not
be of interest to the community largely for that reason. If you
have any ideas about how to improve that, I'm all ears. But do
consider the actual scope of what was needed for the btree coverage
(above). (Note that the index-only scan issue doesn't look like
anything AM-specific; if something is needed for that it would be
in the pruning/vacuum area.)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2015-02-19 16:57:55 Re: pg_upgrade and rsync
Previous Message Tom Lane 2015-02-19 15:48:34 Precedence of standard comparison operators