Re: PROC_IN_ANALYZE stillborn 13 years ago

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PROC_IN_ANALYZE stillborn 13 years ago
Date: 2020-08-07 04:54:19
Message-ID: CANP8+j+BNs35nFx-vyEuc_6E3kHn7JFm=o1aqaUtbZi_JjuT2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 6 Aug 2020 at 22:35, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > ... how
> > important is stability to ANALYZE? If you *either* retake your MVCC
> > snapshots periodically as you re-scan the table *or* use a non-MVCC
> > snapshot for the scan, you can get those same kinds of artifacts: you
> > might see two copies of a just-updated row, or none. Maybe this would
> > actually *break* something - e.g. could there be code that would get
> > confused if we sample multiple rows for the same value in a column
> > that has a UNIQUE index? But I think mostly the consequences would be
> > that you might get somewhat different results from the statistics.
>
> Yeah, that's an excellent point. I can imagine somebody complaining
> "this query clearly matches a unique index, why is the planner estimating
> multiple rows out?". But most of the time it wouldn't matter much.
> (And I think you can get cases like that anyway today.)
>
> > It's not clear to me that it would even be correct to categorize those
> > somewhat-different results as "less accurate."
>
> Estimating two rows where the correct answer is one row is clearly
> "less accurate". But I suspect you'd have to be quite unlucky to
> get such a result in practice from Simon's proposal, as long as we
> weren't super-aggressive about changing ANALYZE's snapshot a lot.
>

Seems like we're agreed we can use more than one snapshot, the only
discussion is "how many?"

The more you take the more weirdness you will see, so adopting an approach
of one-snapshot-per-row seems like the worst case for accuracy, even if it
does make analyze faster.

(If we do want to speed up ANALYZE, we should use the system block sampling
approach, but the argument against that is less independence of rows.)

Keeping the discussion on reducing the impact of bernoulli sampled analyze, I
was imagining we would do one snapshot for each block of rows with default
statistics_target, so that default behavior would be unaffected. Larger
settings would be chunked according to the default, so
stats_target=10k(max) would take a 10000/100 = 100 snapshots. That approach
allows people to vary that using an existing parameter if needed.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
Mission Critical Databases

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2020-08-07 05:38:50 Re: walsender waiting_for_ping spuriously set
Previous Message Amit Langote 2020-08-07 04:42:42 Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)