Re: snapshot too old issues, first around wraparound and then more.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: snapshot too old issues, first around wraparound and then more.
Date: 2020-04-01 18:01:07
Message-ID: 20200401180107.ystycyhqz4rmfcgm@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 11:09 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > There's really no reason at all to have bins of one minute. As it's a
> > PGC_POSTMASTER GUC, it should just have didided time into bins of
> > (old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
> > of a week there's no need to keep 10k bins, and the minimum threshold of
> > 1 minute obviously is problematic.
>
> I am very doubtful that this approach would have been adequate. It
> would mean that, with old_snapshot_threshold set to a week, the
> threshold for declaring a snapshot "old" would jump forward 16.8 hours
> at a time. It's hard for me to make a coherent argument right now as
> to exactly what problems that would create, but it's not very
> granular, and a lot of bloat-related things really benefit from more
> granularity. I also don't really see what the problem with keeping a
> bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
> = ~10k buckets, isn't it? That's not really insane for an in-memory
> data structure. I agree that the code that does that maintenance being
> buggy is a problem to whatever extent that is the case, but writing
> the code to have fewer buckets wouldn't necessarily have made it any
> less buggy.

My issue isn't really that it's too many buckets right now, but that it
doesn't scale down to smaller thresholds. I think to be able to develop
this reasonably, it'd need to be able to support thresholds in the
sub-second range. And I don't see how you can have the same binning for
such small thresholds, and for multi-day thresholds - we'd quickly go to
millions of buckets for longer thresholds.

I really think we'd need to support millisecond resolution to make this
properly testable.

> > GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> > basis to detect conflicts seems dangerous to me. Isn't that ignoring
> > inserts that are already in progress?

> How so?

Because it returns the end of the reserved WAL, not how far we've
actually inserted. I.e. there can be in-progress, but not finished,
modifications that will have an LSN < GetXLogInsertRecPtr(). But the
whenTaken timestamp could reflect one that should throw an error for
these in-progress modifications (since the transaction limiting happens
before the WAL logging).

I am not 100%, but I suspect that that could lead to errors not being
thrown that should, because TestForOldSnapshot() will not see these
in-progress modifications as conflicting.

Hm, also, shouldn't
&& PageGetLSN(page) > (snapshot)->lsn)
in TestForOldSnapshot() be an >=?

> > It currently silently causes wrong query results. There's no
> > infrastructure to detect / protect against that (*).
>
> Sure, and what if you break more stuff ripping it out? Ripping this
> volume of code out in a supposedly-stable branch is totally insane
> almost no matter how broken the feature is.

For the backbranches I was just thinking of forcing the GUC to be off
(either by disallowing it to be set to on, or just warning when its set
to true, but not propagating the value).

> I have no objection to the idea that *if* the feature is hopelessly
> broken, it should be removed.

I would be a lot less inclined to go that way if old_snapshot_threshold

a) weren't explicitly about removing still-needed data - in contrast to
a lot of other features, where the effects of bugs is temporary, here
it can be much larger.
b) were a previously working feature, but as far as I can tell, it never really did
c) had tests that verify that my fixes actually do the right thing. As
it stands, I'd not just have to fix the bugs, I'd also have to develop
a test framework that can test this

While I wish I had been more forceful, and reviewed more of the code to
point out more of the quality issues, I did argue hard against the
feature going in. On account of it being architecturally bad and
impactful. Which I think it has proven to be several times over by
now. And now I'm kind of on the hook to fix it, it seems?

> I also think, and we've had this disagreement before, that you're far
> too willing to say "well, that's wrong so we need to hit it with a
> nuke." I complained when you added those error checks to vacuum in
> back-branches, and since that release went out people are regularly
> tripping those checks and taking prolonged outages for a problem that
> wasn't making them unhappy before. I know that in theory those people
> are better off because their database was always corrupted and now
> they know. But for some of them, those prolonged outages are worse
> than the problem they had before.

I think this is a somewhat revisionist. Sure, the errors were added
after like the 10th data corruption bug around freezing that we didn't
find for a long time, because of the lack of errors being thrown. But
the error checks weren't primarily added to find further bugs, but to
prevent data loss due to the fixed bug. Of which we had field reports.

I'd asked over *weeks* for reviews of the bug fixes. Not a single person
expressed concerns about throwing new errors at that time. First version
of the patches with the errors:
https://postgr.es/m/20171114030341.movhteyakqeqx5pm%40alap3.anarazel.de
I pushed them over a month later
https://postgr.es/m/20171215023059.oeyindn57oeis5um%40alap3.anarazel.de

There also wasn't (and isn't) a way to just report back that we can't
currently freeze the individual page, without doing major surgery. And
even if there were, what are supposed to do other than throw an error?
We need to remove tuples below relfrozenxid, or we corrupt the table.

As I've first asked before when you complained about those errors: What
was the alternative? Just have invisible tuples reappear? Delete them? I
don't think you've ever answered that.

You brought this up as an example for me being over-eager with errors
checks before. But I don't see how that meshes with the history visible
in the thread referenced above.

The more general issue, about throwing errors, is not just about the
people that don't give a hoot about whether their data evolves on its
own (perhaps a good tradeoff for them). Not throwing errors affects
*everyone*. Some people do care about their data. Without errors we
never figure out that we screwed up. And long-term, even the people
that care much more about availability than data loss, benefit from the
whole system getting more robust.

We've since found numerous further data corrupting bugs because of the
relfrozenxid checks. Some of very long standing vintage. Some in newly
added code.

Yes, hypothetically, I'd argue for introducing the checks solely for the
sake of finding bugs. Even if I were prescient to forsee the number of
issues caused (although I'd add block numbers to the error message from
the get go, knowing that). But I'd definitely not do so in the back
branches.

> I believe it was irresponsible to decide on behalf of our entire user
> base that they were better off with such a behavior change in a
> supposedly-stable branch, and I believe the same thing here.

As I explained above, I don't think that's fair with regard to the
relfrozenxid errors. Setting that aside:

In these discussions you always seem to only argue for the people that
don't care about their data. But, uh, a lot of people do - should we
just silently eat their data? And the long-term quality of the project
gets a lot better by throwing errors, because it actually allows us to
fix them.

As far as I can tell we couldn't even have added the checks to master,
back then, if we follow your logic: A lot of the reports about hitting
the errors were with 11+ (partially due to pg_upgrade, partially because
they detected other bugs).

The likelihood of hurting people by adding checks at a later point would
be a lot lower, if we stopped adding code that ignores errors silently
and hoping for the best. But we keep adding such "lenient" code.

We just found another long-standing cause of data corrupting, that
should have been found earlier if we had errors, or at least warnings,
btw. The locking around vac_udpate_datfrozenxid() has been broken for a
long long time, but the silent 'if (bogus) return' made it very hard to
find.
https://www.postgresql.org/message-id/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de

Also, I've recently seen a number of databases beeing eaten because we
just ignore our own WAL logging rules to avoid throwing hard enough
errors (RelationTruncate() WAL logging the truncation outside of a
critical section - oops if you hit it, your primary and replicas/backups
diverge, among many other bad consequences).

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-04-01 18:04:43 Re: snapshot too old issues, first around wraparound and then more.
Previous Message Fujii Masao 2020-04-01 17:51:36 Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)