From: | "Joshua Tolley" <eggyknap(at)gmail(dot)com> |
---|---|

To: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |

Subject: | Cross-column statistics revisited |

Date: | 2008-10-15 10:53:10 |

Message-ID: | e7e0a2570810150353h49826e8bh9dde3676109a7574@mail.gmail.com |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

I've been interested in what it would take to start tracking

cross-column statistics. A review of the mailing lists as linked from

the TODO item on the subject [1] suggests the following concerns:

1) What information exactly would be tracked?

2) How would it be kept from exploding in size?

3) For which combinations of columns would statistics be kept?

The major concern in #1 seemed to be that the most suitable form for

keeping most common value lists, histograms, etc. is in an array, and

at the time of the posts I read, arrays of composite types weren't

possible. This seems much less of a concern now -- perhaps in greatest

part because a test I just did against a recent 8.4devel sure makes it

look like stats on composite type columns aren't even kept. The most

straightforward is that we'd keep a simple multi-dimensional

histogram, but that leads to a discussion of #2.

#2 is an interesting problem, and possibly the most difficult from a

theoretical perspective. Provided a fair expectation that the results

would be useful, I'd like to play with various forms of statistical

regressions to compress large cross-column histograms. But I'm not

sure I want to spend the time if there are other serious issues with

the whole idea of cross-column stats. Another possibility involves not

storing a histogram at all, but instead storing an array of quantiles

for each column. In other words, if S represents the statistics target

of a column, we'd have an array A of S+1 entries of the type in

question, where the beginning and end of the array are the minimum and

maximum values in the column, and the range between any two

consecutive array elements A[n] and A[n+1] contains 1/S of the total

entries in the column. For a uniformly distributed column, these array

elements would be evenly spaced across the total range of the column;

the difference between consecutive array elements would indicate

deviations from this distribution. Unfortunately this only works if

there's a valid distance metric for the data type in question.

In the archives, #3 was raised frequently as a concern -- obviously

tracking stats on all permutations of the columns in a database is a

non-starter. However there seemed to be little argument over sensible

defaults, namely that columns involved in a multi-column index would

have cross-column statistics kept automatically, and the user would be

allowed to instruct the server to keep stats on other arbitrary groups

of columns. There was some discussion about the fact that the user

would have to be smart about how (s)he used this feature, but there

are enough other potential foot guns in the system of similar

magnitude that worrying too much about that seems of little use.

I bring all this up because I'm interested in looking at how this

might be done, and perhaps doing it if 1) time permits, and 2) someone

else doesn't beat me to it. Comments, anyone? Am I missing something

obvious/serious/etc.? Thanks in advance.

- Josh / eggyknap

[1] "Allow accurate statistics to be collected on indexes with more

than one column or expression indexes, perhaps using per-index

statistics", http://wiki.postgresql.org/wiki/Todo

- Re: Cross-column statistics revisited at 2008-10-15 13:51:37 from Gregory Stark
- Re: Cross-column statistics revisited at 2008-10-16 17:11:26 from Martijn van Oosterhout

From | Date | Subject | |
---|---|---|---|

Next Message | Andrew Dunstan | 2008-10-15 13:16:38 | Re: 8.3 .4 + Vista + MingW + initdb = ACCESS_DENIED |

Previous Message | Dave Page | 2008-10-15 09:59:15 | Re: 8.3 .4 + Vista + MingW + initdb = ACCESS_DENIED |