Re: estimating # of distinct values

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: estimating # of distinct values
Date: 2010-12-28 00:03:47
Message-ID: 4D192963.8080305@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 27.12.2010 22:46, Robert Haas napsal(a):
> 2010/12/27 Tomas Vondra <tv(at)fuzzy(dot)cz>:
>> But even though these disadvantages, there really is no other
>> way to enhance the estimates. I don't think this should be a
>> default behavior - just as in case of cross-column stats this should
>> be optional when the current estimator does not work well.
>
> This is going to be a lot of work to implement, so before you do it,
> we should try to reach a consensus that (a) it's part of an overall
> strategy that the community generally supports and (b) we have
> consensus on the design for this part.

Yes, it is going to be a lot of work. But I don't think we need a
consensus of the whole community prior to building something that works.
I plan to build a very simple prototype soon, so let's talk about this
later.

I've started these two threads mainly as a 'brainstorming' - to present
what I've learned from the various papers, propose a possible solution,
highlight issues I see, and get ideas on how to solve them.

> With respect to (a), I have to admit I've found the discussion on
> cross-column stats to be quite difficult to follow. I'd like to see a
> very simple description of exactly what information we're going to
> store, under what circumstances we'll store it, and how we'll use it
> to compute selectivity estimates.

Yes, I know it was difficult to follow that discussion. That's why I
created a 'summary page' on wiki

http://wiki.postgresql.org/wiki/Cross_Columns_Stats

Generally we need to gather two types of stats:

(a) multi-dimensional histogram - this can be done about the same way
we create single-dimensional histograms, that's not a big problem
(although more data might be necessary to get accurate histograms)

(b) # of distinct values - this is the tricky part, as described in
this problem

> With respect to (b), I think I'd need to see a much more detailed
> design for how you intend to make this work. Off the top of my head
> there seems to be some pretty serious feasibility problems.

Well, it's not going to be easy, that's for sure. My "big plan" is
something like this:

(a) Build a simple 'contrib-like' module that allows manual collection
of stats and computing of estimates (not integrated with the
optimizer in any way).

This should help us better understand what stats do we need etc.

(b) Minimal integration with the core - still manual collection fo
stats, the optimizer uses the stats if available.

(c) Enhancements - automatic updates of the stats, etc.

And all this should be implemented so that you don't have to pay unless
you want to use the new stats.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2010-12-28 00:19:47 Re: "writable CTEs"
Previous Message Kevin Grittner 2010-12-27 23:04:20 Re: estimating # of distinct values