Re: PROC_IN_ANALYZE stillborn 13 years ago

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: PROC_IN_ANALYZE stillborn 13 years ago
Date: 2020-08-06 20:22:23
Message-ID: CA+TgmoZ9hycF=fu0V+812fOvTsje5bCV6vMdDWphSANOiv-vqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 6, 2020 at 3:11 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> (1) Without a snapshot it's hard to make any non-bogus decisions about
> which tuples are live and which are dead. Admittedly, with Simon's
> proposal the final totals would be spongy anyhow, but at least the
> individual decisions produce meaningful answers.

I don't think I believe this. It's impossible to make *consistent*
decisions, but it's not difficult to make *non-bogus* decisions.
HeapTupleSatisfiesVacuum() and HeapTupleSatifiesUpdate() both make
such decisions, and neither takes a snapshot argument.

> (2) I'm pretty sure there are places in the system that assume that any
> reader of a table is using an MVCC snapshot. For instance, didn't you
> introduce some such assumptions along with or just after getting rid of
> SnapshotNow for catalog scans?

SnapshotSelf still exists and is still used, and IIRC, it has very
similar semantics to the old SnapshotNow, so I don't think that we
introduced any really general assumptions of this sort. I think the
important part of those changes was that all the code that had
previously used SnapshotNow to examine system catalog tuples for DDL
purposes and catcache lookups and so forth started using an MVCC scan,
which removed one (of many) impediments to concurrent DDL. I think the
fact that we removed SnapshotNow outright rather than just ceasing to
use it for that purpose was mostly so that nobody would accidentally
reintroduce code that used it for the sorts of purposes for which it
had been used previously, and secondarily for code cleanliness.
There's nothing wrong with it fundamentally AFAIK.

It's worth mentioning, I think, that the main problem with SnapshotNow
was that it provided no particular stability. If you did an index scan
under SnapshotNow you might find two copies or no copies of a row
being concurrently updated, rather than exactly one. And that in turn
could cause problems like failure to build a relcache entry. Now, how
important is stability to ANALYZE? If you *either* retake your MVCC
snapshots periodically as you re-scan the table *or* use a non-MVCC
snapshot for the scan, you can get those same kinds of artifacts: you
might see two copies of a just-updated row, or none. Maybe this would
actually *break* something - e.g. could there be code that would get
confused if we sample multiple rows for the same value in a column
that has a UNIQUE index? But I think mostly the consequences would be
that you might get somewhat different results from the statistics.

It's not clear to me that it would even be correct to categorize those
somewhat-different results as "less accurate." Tuples that are
invisible to a query often have performance consequences very similar
to visible tuples, in terms of the query run time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2020-08-06 21:04:35 Re: PostgreSQL 13 Beta 3 Release Date (+ Update Release)
Previous Message Fabrízio de Royes Mello 2020-08-06 20:11:49 Re: pg_dump bug for extension owned tables