Re: Multidimensional Histograms

From: Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Alexander Cheshev <alex(dot)cheshev(at)gmail(dot)com>
Subject: Re: Multidimensional Histograms
Date: 2024-01-07 17:26:43
Message-ID: 8af87595-2e6f-43cd-9a20-cc5489c18e49@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/1/2024 17:51, Tomas Vondra wrote:
> On 1/7/24 11:22, Andrei Lepikhov wrote:
>> On 7/1/2024 06:54, Tomas Vondra wrote:
>>> It's an interesting are for experiments, no doubt about it. And if you
>>> choose to explore it, that's fine. But it's better to be aware it may
>>> not end with a commit.
>>> For the multi-dimensional case, I propose we first try to experiment
>>> with the various algorithms, and figure out what works etc. Maybe
>>> implementing them in python or something would be easier than C.
>>
>> Curiously, trying to utilize extended statistics for some problematic
>> cases, I am experimenting with auto-generating such statistics by
>> definition of indexes [1]. Doing that, I wanted to add some hand-made
>> statistics like a multidimensional histogram or just a histogram which
>> could help to perform estimation over a set of columns/expressions.
>> I realized that current hooks get_relation_stats_hook and
>> get_index_stats_hook are insufficient if I want to perform an estimation
>> over a set of ANDed quals on different columns.
>> In your opinion, is it possible to add a hook into the extended
>> statistics to allow for an extension to propose alternative estimation?
>>
>> [1] https://github.com/danolivo/pg_index_stats
>>
>
> No idea, I haven't thought about that very much. Presumably the existing
> hooks are insufficient because they're per-attnum? I guess it would make
> sense to have a hook for all the attnums of the relation, but I'm not
> sure it'd be enough to introduce a new extended statistics kind ...

I got stuck on the same problem Alexander mentioned: we usually have
large tables with many uniformly distributed values. In this case, MCV
doesn't help a lot.
Usually, I face problems scanning a table with a filter containing 3-6
ANDed quals. Here, Postgres multiplies selectivities and ends up with a
less than 1 tuple selectivity. But such scans, in reality, mostly have
some physical sense and return a bunch of tuples. It looks like the set
of columns representing some value of composite type.
Sometimes extended statistics on dependency helps well, but it expensive
for multiple columns. And sometimes I see that even a trivial histogram
on a ROW(x1,x2,...) could predict a much more adequate value (kind of
conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
ROW(N1, N2,...).
For such cases we don't have an in-core solution, and introducing a hook
on clause list estimation (paired with maybe a hook on statistics
generation) could help invent an extension that would deal with that
problem. Also, it would open a way for experiments with different types
of extended statistics ...

--
regards,
Andrei Lepikhov
Postgres Professional

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-01-07 18:36:17 Re: Multidimensional Histograms
Previous Message Alexander Lakhin 2024-01-07 14:00:00 Re: Add a perl function in Cluster.pm to generate WAL