Re: snapshot too old issues, first around wraparound and then more.

From: Kevin Grittner <kgrittn(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: snapshot too old issues, first around wraparound and then more.
Date: 2020-04-01 19:10:09
Message-ID: CACjxUsMe9kXpNN7UEzSHPRN0xAKXxfaLLihuVDqQXwOmDqOvdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 1, 2020 at 10:09 AM Andres Freund <andres(at)anarazel(dot)de> wrote:

First off, many thanks to Andres for investigating this, and apologies for
the bugs. Also thanks to Michael for making sure I saw the thread. I must
also apologize that for not being able to track the community lists
consistently due to health issues that are exacerbated by stress, and the
fact that these lists often push past my current limits. I'll try to help
in this as best I can.

Do we actually have any evidence of this feature ever beeing used? I
> didn't find much evidence for that in the archives (except Thomas
> finding a problem).

This was added because a very large company trying to convert from Oracle
had a test that started to show some slowdown on PostgreSQL after 8 hours,
serious slowdown by 24 hours, and crashed hard before it could get to 48
hours -- due to lingering WITH HOLD cursors left by ODBC code. They had
millions of lines of code that would need to be rewritten without this
feature. With this feature (set to 20 minutes, if I recall correctly),
their unmodified code ran successfully for at least three months solid
without failure or corruption. Last I heard, they were converting a large
number of instances from Oracle to PostgreSQL, and those would all fail
hard within days of running with this feature removed or disabled.

Also, VMware is using PostgreSQL as an embedded part of many products, and
this feature was enabled to deal with similar failures due to ODBC cursors;
so the number of instances running 24/7 under high load which have shown a
clear benefit from enabling this feature has a lot of zeros.

Perhaps the lack of evidence for usage in the archives indicates a low
frequency of real-world failures due to the feature, rather than lack of
use? I'm not doubting that Andres found real issues that should be fixed,
but perhaps not very many people who are using the feature have more than
two billion transactions within the time threshold, and perhaps the other
problems are not as big as the problems solved by use of the feature -- at
least in some cases.

To save readers who have not yet done the math some effort, at the 20
minute threshold used by the initial user, they would need to have a
sustained rate of consumption of transaction IDs of over 66 million per
second to experience wraparound problems, and at the longest threshold I
have seen it would need to exceed an average of 461,893 TPS for three days
solid to hit wraparound. Those aren't impossible rates to hit, but in
practice it might not be a frequent occurrence yet on modern hardware with
some real-world applications. Hopefully we can find a way to fix this
before those rates become common.

I am reviewing the issue and patches now, and hope I can make some useful
contribution to the discussion.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-04-01 19:11:15 Re: error context for vacuum to include block number
Previous Message Robert Haas 2020-04-01 19:02:28 Re: snapshot too old issues, first around wraparound and then more.