Re: Merging statistics from children instead of re-sampling everything

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Merging statistics from children instead of re-sampling everything
Date: 2022-02-10 22:37:26
Message-ID: 4e86ae74-4e2c-b40f-4405-035d2f818e5d@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/10/22 12:50, Andrey Lepikhov wrote:
> On 21/1/2022 01:25, Tomas Vondra wrote:
>> But I don't have a very good idea what to do about statistics that we
>> can't really merge. For some types of statistics it's rather tricky to
>> reasonably merge the results - ndistinct is a simple example, although
>> we could work around that by building and merging hyperloglog counters.
>
> I think, as a first step on this way we can reduce a number of pulled
> tuples. We don't really needed to pull all tuples from a remote server.
> To construct a reservoir, we can pull only a tuple sample. Reservoir
> method needs only a few arguments to return a sample like you read
> tuples locally. Also, to get such parts of samples asynchronously, we
> can get size of each partition on a preliminary step of analysis.
> In my opinion, even this solution can reduce heaviness of a problem
> drastically.
>

Oh, wow! I haven't realized we're fetching all the rows from foreign
(postgres_fdw) partitions. For local partitions we already do that,
because that uses the usual acquire function, with a reservoir
proportional to partition size. I have assumed we use tablesample to
fetch just a small fraction of rows from FDW partitions, and I agree
doing that would be a pretty huge benefit.

I actually tried hacking that together - there's a couple problems with
that (e.g. determining what fraction to sample using bernoulli/system),
but in principle it seems quite doable. Some minor changes to the FDW
API may be necessary, not sure.

Not sure about the async execution - that seems way more complicated,
and the sampling reduces the total cost, async just parallelizes it.

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-02-10 22:45:17 Re: Add jsonlog log_destination for JSON server logs
Previous Message Andres Freund 2022-02-10 22:26:59 Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: