Quick Links

Re: PROC_IN_ANALYZE stillborn 13 years ago

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: PROC_IN_ANALYZE stillborn 13 years ago
Date:	2020-08-07 21:37:27
Message-ID:	20200807213727.qjfvltalwt63xbib@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2020-08-06 18:02:26 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > In fact using conceptually like a new snapshot for each sample tuple
> > actually seems like it'd be somewhat of an improvement over using a
> > single snapshot.
>
> Dunno, that feels like a fairly bad idea to me. It seems like it would
> overemphasize the behavior of whatever queries happened to be running
> concurrently with the ANALYZE. I do follow the argument that using a
> single snapshot for the whole ANALYZE overemphasizes a single instant
> in time, but I don't think that leads to the conclusion that we shouldn't
> use a snapshot at all.

I didn't actually want to suggest that we should take a separate
snapshot for every sampled row - that'd be excessively costly. What I
wanted to say was that I don't think that I don't see a clear accuraccy
benefit. E.g. not seeing any of the values inserted more recently will
under-emphasize those in the histogram.

What precisely do you mean with "overemphasize" above? I mean those will
e the rows most likely to live after the analyze is done, so including
them doesn't seem like a bad thing to me?

> Another angle that would be worth considering, aside from the issue
> of whether the sample used for pg_statistic becomes more or less
> representative, is what impact all this would have on the tuple count
> estimates that go to the stats collector and pg_class.reltuples.
> Right now, we don't have a great story at all on how the stats collector's
> count is affected by combining VACUUM/ANALYZE table-wide counts with
> the incremental deltas reported by transactions happening concurrently
> with VACUUM/ANALYZE. Would changing this behavior make that better,
> or worse, or about the same?

Hm. Vacuum already counts rows that are inserted concurrently with the
vacuum scan, if it encounters them. Analyze doesn't. Seems like we'd at
least be wrong in a more consistent manner than before...

IIUC both analyze and vacuum will overwrite concurrent changes to
n_live_tuples. So taking concurrently committed changes into account
seems like it'd be the right thing?

We probably could make this more accurate by accounting separately for
"recently inserted and committed" rows, and taking the difference of
n_live_tuples before/after into account. But I'm a bit doubtful that
it's worth it?

Greetings,

Andres Freund

In response to

Re: PROC_IN_ANALYZE stillborn 13 years ago at 2020-08-06 22:02:26 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2020-08-07 22:28:46	Re: Should the nbtree page split REDO routine's locking work more like the locking on the primary?
Previous Message	Alvaro Herrera	2020-08-07 21:35:44	Re: PROC_IN_ANALYZE stillborn 13 years ago