Re: Cross-column statistics revisited

From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 03:17:03
Message-ID: e7e0a2570810162017q5d778b87n600c21874162cdd4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 16, 2008 at 8:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Joshua Tolley" <eggyknap(at)gmail(dot)com> writes:
>> For what it's worth, neither version of correlation was what I had in
>> mind. Statistical correlation between two variables is a single
>> number, is fairly easy to calculate, and probably wouldn't help query
>> plans much at all. I'm more interested in a more complex data
>> gathering. The data model I have in mind (which I note I have *not*
>> proven to actually help a large number of query plans -- that's
>> obviously an important part of what I'd need to do in all this)
>> involves instead a matrix of frequency counts.
>
> Oh, I see the distinction you're trying to draw. Agreed on both points:
> a simple correlation number is pretty useless to us, and we don't have
> hard evidence that a histogram-matrix will solve the problem. However,
> we do know that one-dimensional histograms of the sort currently
> collected by ANALYZE work pretty well (if they're detailed enough).
> It seems reasonable to me to extrapolate that same concept to two or
> more dimensions. The issue then becomes that a "detailed enough"
> matrix might be impracticably bulky, so you need some kind of lossy
> compression, and we don't have hard evidence about how well that will
> work. Nonetheless the road map seems clear enough to me.

Oh goodie -- I feel so validated :)

>
>> Right now our
>> "histogram" values are really quantiles; the statistics_target T for a
>> column determines a number of quantiles we'll keep track of, and we
>> grab values from into an ordered list L so that approximately 1/T of
>> the entries in that column fall between values L[n] and L[n+1]. I'm
>> thinking that multicolumn statistics would instead divide the range of
>> each column up into T equally sized segments,
>
> Why would you not use the same histogram bin bounds derived for the
> scalar stats (along each axis of the matrix, of course)? This seems to
> me to be arbitrarily replacing something proven to work with something
> not proven. Also, the above forces you to invent a concept of "equally
> sized" ranges, which is going to be pretty bogus for a lot of datatypes.

Because I'm trying to picture geometrically how this might work for
the two-column case, and hoping to extend that to more dimensions, and
am finding that picturing a quantile-based system like the one we have
now in multiple dimensions is difficult. I believe those are the same
difficulties Gregory Stark mentioned having in his first post in this
thread. But of course that's an excellent point, that what we do now
is proven. I'm not sure which problem will be harder to solve -- the
weird geometry or the "equally sized ranges" for data types where that
makes no sense.

One option that came to mind recently would be to have statistics for
a subset of the rows in a single column A, where that subset is
defined by conditions on some other column B with a defined
relationship to A (for instance, that they're in the same table).
There are all kinds of complexities possible with that idea, but one
bright spot is that the values in the catalog and the way the planner
makes use of them would stay essentially the same as they are now.
That one's interesting to think about, but probably too complex for
real use. Anyway, I'll keep pondering.

- Josh / eggyknap

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2008-10-17 06:24:21 Re: Cross-column statistics revisited
Previous Message Tom Lane 2008-10-17 02:38:44 Re: Cross-column statistics revisited