Re: Merging statistics from children instead of re-sampling everything

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: d(dot)belyalov(at)postgrespro(dot)ru
Subject: Re: Merging statistics from children instead of re-sampling everything
Date: 2022-02-11 15:12:16
Message-ID: 901c13d7-b37f-3a2d-062a-3696d58d030c@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/11/22 05:29, Andrey V. Lepikhov wrote:
> On 2/11/22 03:37, Tomas Vondra wrote:
>> That being said, this thread was not really about foreign partitions,
>> but about re-analyzing inheritance trees in general. And sampling
>> foreign partitions doesn't really solve that - we'll still do the
>> sampling over and over.
> IMO, to solve the problem we should do two things:
> 1. Avoid repeatable partition scans in the case inheritance tree.
> 2. Avoid to re-analyze everything in the case of active changes in small
> subset of partitions.
>
> For (1) i can imagine a solution like multiplexing: on the stage of
> defining which relations to scan, group them and prepare parameters of
> scanning to make multiple samples in one shot.
>> It looks like we need a separate logic for analysis of partitioned
> tables - we should form and cache samples on each partition before an
> analysis.
> It requires a prototype to understand complexity of such solution and
> can be done separately from (2).
>

I'm not sure I understand what you mean by multiplexing. The term
usually means "sending multiple signals at once" but I'm not sure how
that applies to this issue. Can you elaborate?

I assume you mean something like collecting a sample for a partition
once, and then keeping and reusing the sample for future ANALYZE runs,
until invalidated in some sense.

Yeah, I agree that'd be useful - and not just for partitions, actually.
I've been pondering something like that for regular tables, because the
sample might be used for estimation of clauses directly.

But it requires storing the sample somewhere, and I haven't found a good
and simple way to do that. We could serialize that into bytea, or we
could create a new fork, or something, but what should that do with
oversized attributes (how would TOAST work for a fork) and/or large
samples (which might not fit into 1GB bytea)?

> Task (2) is more difficult to solve. Here we can store samples from each
> partition in values[] field of pg_statistic or in specific table which
> stores a 'most probable values' snapshot of each table.

I think storing samples in pg_statistic is problematic, because values[]
is subject to 1GB limit etc. Not an issue for small MCV list of a single
attribute, certainly an issue for larger samples. Even if the data fit,
the size of pg_statistic would explode.

> Most difficult problem here, as you mentioned, is ndistinct value. Is it
> possible to store not exactly calculated value of ndistinct, but an
> 'expected value', based on analysis of samples and histograms on
> partitions? Such value can solve also a problem of estimation of a SETOP
> result grouping (joining of them, etc), where we have statistics only on
> sources of the union.
>

I think ndistinct is problem only when merging final estimates. But if
we're able to calculate and store some intermediate results, we can
easily use HLL and merge that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-02-11 15:16:48 the build farm is ok, but not the hippopotamus (or the jay)
Previous Message Robert Haas 2022-02-11 15:07:49 Re: Teach pg_receivewal to use lz4 compression