Quick Links

Re: Cross-column statistics revisited

From:	"Nathan Boley" <npboley(at)gmail(dot)com>
To:	"Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject:	Re: Cross-column statistics revisited
Date:	2008-10-17 21:47:31
Message-ID:	6fa3b6e20810171447o43c5d28ar3bb98e2cf5b47e5a@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>>> Right now our
>>> "histogram" values are really quantiles; the statistics_target T for a
>>> column determines a number of quantiles we'll keep track of, and we
>>> grab values from into an ordered list L so that approximately 1/T of
>>> the entries in that column fall between values L[n] and L[n+1]. I'm
>>> thinking that multicolumn statistics would instead divide the range of
>>> each column up into T equally sized segments,
>>
>> Why would you not use the same histogram bin bounds derived for the
>> scalar stats (along each axis of the matrix, of course)? This seems to
>> me to be arbitrarily replacing something proven to work with something
>> not proven. Also, the above forces you to invent a concept of "equally
>> sized" ranges, which is going to be pretty bogus for a lot of datatypes.
>
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult. I believe those are the same
> difficulties Gregory Stark mentioned having in his first post in this
> thread. But of course that's an excellent point, that what we do now
> is proven. I'm not sure which problem will be harder to solve -- the
> weird geometry or the "equally sized ranges" for data types where that
> makes no sense.
>

Look at copulas. They are a completely general method of describing
the dependence between two marginal distributions. It seems silly to
rewrite the stats table in terms of joint distributions when we'll
still need the marginals anyways. Also, It might be easier to think of
the dimension reduction problem in that form.

In response to

Re: Cross-column statistics revisited at 2008-10-17 03:17:03 from Joshua Tolley

Responses

Re: Cross-column statistics revisited at 2008-10-18 00:47:38 from Joshua Tolley

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Joshua Tolley	2008-10-18 00:47:38	Re: Cross-column statistics revisited
Previous Message	Tom Lane	2008-10-17 21:21:25	Re: Incorrect cursor behaviour with gist index