Re: PROC_IN_ANALYZE stillborn 13 years ago

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: PROC_IN_ANALYZE stillborn 13 years ago
Date: 2020-08-07 21:37:27
Message-ID: 20200807213727.qjfvltalwt63xbib@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-08-06 18:02:26 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > In fact using conceptually like a new snapshot for each sample tuple
> > actually seems like it'd be somewhat of an improvement over using a
> > single snapshot.
>
> Dunno, that feels like a fairly bad idea to me. It seems like it would
> overemphasize the behavior of whatever queries happened to be running
> concurrently with the ANALYZE. I do follow the argument that using a
> single snapshot for the whole ANALYZE overemphasizes a single instant
> in time, but I don't think that leads to the conclusion that we shouldn't
> use a snapshot at all.

I didn't actually want to suggest that we should take a separate
snapshot for every sampled row - that'd be excessively costly. What I
wanted to say was that I don't think that I don't see a clear accuraccy
benefit. E.g. not seeing any of the values inserted more recently will
under-emphasize those in the histogram.

What precisely do you mean with "overemphasize" above? I mean those will
e the rows most likely to live after the analyze is done, so including
them doesn't seem like a bad thing to me?

> Another angle that would be worth considering, aside from the issue
> of whether the sample used for pg_statistic becomes more or less
> representative, is what impact all this would have on the tuple count
> estimates that go to the stats collector and pg_class.reltuples.
> Right now, we don't have a great story at all on how the stats collector's
> count is affected by combining VACUUM/ANALYZE table-wide counts with
> the incremental deltas reported by transactions happening concurrently
> with VACUUM/ANALYZE. Would changing this behavior make that better,
> or worse, or about the same?

Hm. Vacuum already counts rows that are inserted concurrently with the
vacuum scan, if it encounters them. Analyze doesn't. Seems like we'd at
least be wrong in a more consistent manner than before...

IIUC both analyze and vacuum will overwrite concurrent changes to
n_live_tuples. So taking concurrently committed changes into account
seems like it'd be the right thing?

We probably could make this more accurate by accounting separately for
"recently inserted and committed" rows, and taking the difference of
n_live_tuples before/after into account. But I'm a bit doubtful that
it's worth it?

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-08-07 22:28:46 Re: Should the nbtree page split REDO routine's locking work more like the locking on the primary?
Previous Message Alvaro Herrera 2020-08-07 21:35:44 Re: PROC_IN_ANALYZE stillborn 13 years ago