Quick Links

Re: estimating # of distinct values

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: estimating # of distinct values
Date:	2010-12-28 00:03:47
Message-ID:	4D192963.8080305@fuzzy.cz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Dne 27.12.2010 22:46, Robert Haas napsal(a):
> 2010/12/27 Tomas Vondra <tv(at)fuzzy(dot)cz>:
>> But even though these disadvantages, there really is no other
>> way to enhance the estimates. I don't think this should be a
>> default behavior - just as in case of cross-column stats this should
>> be optional when the current estimator does not work well.
>
> This is going to be a lot of work to implement, so before you do it,
> we should try to reach a consensus that (a) it's part of an overall
> strategy that the community generally supports and (b) we have
> consensus on the design for this part.

Yes, it is going to be a lot of work. But I don't think we need a
consensus of the whole community prior to building something that works.
I plan to build a very simple prototype soon, so let's talk about this
later.

I've started these two threads mainly as a 'brainstorming' - to present
what I've learned from the various papers, propose a possible solution,
highlight issues I see, and get ideas on how to solve them.

> With respect to (a), I have to admit I've found the discussion on
> cross-column stats to be quite difficult to follow. I'd like to see a
> very simple description of exactly what information we're going to
> store, under what circumstances we'll store it, and how we'll use it
> to compute selectivity estimates.

Yes, I know it was difficult to follow that discussion. That's why I
created a 'summary page' on wiki

http://wiki.postgresql.org/wiki/Cross_Columns_Stats

Generally we need to gather two types of stats:

(a) multi-dimensional histogram - this can be done about the same way
we create single-dimensional histograms, that's not a big problem
(although more data might be necessary to get accurate histograms)

(b) # of distinct values - this is the tricky part, as described in
this problem

> With respect to (b), I think I'd need to see a much more detailed
> design for how you intend to make this work. Off the top of my head
> there seems to be some pretty serious feasibility problems.

Well, it's not going to be easy, that's for sure. My "big plan" is
something like this:

(a) Build a simple 'contrib-like' module that allows manual collection
of stats and computing of estimates (not integrated with the
optimizer in any way).

This should help us better understand what stats do we need etc.

(b) Minimal integration with the core - still manual collection fo
stats, the optimizer uses the stats if available.

And all this should be implemented so that you don't have to pay unless
you want to use the new stats.

regards
Tomas

In response to

Re: estimating # of distinct values at 2010-12-27 21:46:32 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2010-12-28 00:19:47	Re: "writable CTEs"
Previous Message	Kevin Grittner	2010-12-27 23:04:20	Re: estimating # of distinct values