Re: snapshot too old issues, first around wraparound and then more.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: snapshot too old issues, first around wraparound and then more.
Date: 2020-04-03 19:03:09
Message-ID: 20200403190309.xltgvtkcn5bug5dx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-04-03 14:32:09 +0530, Amit Kapila wrote:
> On Fri, Apr 3, 2020 at 6:52 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> >
> > On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > Since this is a feature that can result in wrong query results (and
> > > quite possibly crashes / data corruption), I don't think we can just
> > > leave this unfixed. But given the amount of code / infrastructure
> > > changes required to get this into a working feature, I don't see how we
> > > can unleash those changes onto the stable branches.
> >
>
> As per my initial understanding, the changes required are (a) There
> seem to be multiple places where TestForOldSnapshot is missing, (b)
> TestForOldSnapshot itself need to be reviewed carefully to see if it
> has problems, (c) Some of the members of OldSnapshotControlData like
> head_timestamp and xid_by_minute are not maintained accurately, (d)
> handling of wraparound for xids in the in-memory data-structure for
> this feature is required, (e) test infrastructure is not good enough
> to catch bugs or improve this feature.

And a bunch more correctness issues. But basically, yes.

When you say "(c) Some of the members of OldSnapshotControlData like
head_timestamp and xid_by_minute are not maintained accurately)" - note
that that's the core state for the whole feature.

With regards to test: "not good enough" is somewhat of an
understatement. Not a *single* tuple is removed in the tests due to
old_snapshot_threshold - and removing tuples is the entire point.

> Now, this sounds like a quite of work but OTOH, if we see most of the
> critical changes required will be in only a few functions like
> TransactionIdLimitedForOldSnapshots(),
> MaintainOldSnapshotTimeMapping(), TestForOldSnapshot().

I don't think that's really the case. Every place reading a buffer needs
to be inspected, and new calls added. They aren't free, and I'm not sure
all of them have the relevant snapshot available. To fix the issue of
spurious errors, we'd likely need changes outside of those, and it'd
quite possibly have performance / bloat implications.

> I don't deny the possibility that we might need much more work or we
> need to come up with quite a different design to address all these
> problems but unless Kevin or someone else doesn't come up with a
> solution to address all of these problems, we can't be sure of that.
>
> > I don't think that the feature can be allowed to remain in anything
> > like its current form. The current design is fundamentally unsound.
> >
>
> Agreed, but OTOH, not giving time to Kevin or others who might be
> interested to support this work is also not fair. I think once
> somebody comes up with patches for problems we can decide whether this
> feature can be salvaged in back-branches or we need to disable it in a
> hard-way. Now, if Kevin himself is not interested in fixing or nobody
> shows up to help, then surely we can take the decision sooner but
> giving time for a couple of weeks (or even till we are near for PG13
> release) in this case doesn't seem like a bad idea.

It'd certainly be great if somebody came up with fixes, yes. Even if we
had to disable it in the back branches, that'd allow us to keep the
feature around, at least.

The likelihood of regressions even when the feature is not on does not
seem that low. But you're right, we'll be able to better judge it with
fixes to look at.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-03 19:07:08 pgsql: Generate backup manifests for base backups, and validate them.
Previous Message Anna Akenteva 2020-04-03 18:51:13 Re: [HACKERS] make async slave to wait for lsn to be replayed