From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|

To: | pgsql-hackers(at)postgresql(dot)org |

Subject: | Re: proposal : cross-column stats |

Date: | 2010-12-17 20:50:34 |

Message-ID: | 4D0BCD1A.4080001@fuzzy.cz |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

Hi,

I've read about 10 papers about estimation this week, and I have gained

some insight. I'm not going to describe each of the papers here, I plan

to set up a wiki page about cross column statistics

http://wiki.postgresql.org/wiki/Cross_Columns_Stats

and I'll put the list of papers and some basic info there. There was a

lot of discussion in this thread, and it's difficult to follow it, so

I'll put there some details about the proposal etc.

So I'll just briefly describe the interesting things I've learned from

those papers.

---------------------- instances of the problem ----------------------

Generally, there are four quite different cases of the problem.

First, the columns may be either real-valued or discrete. And by

'discrete' I mean the value is rather a label than a value. It does not

make sense to add or subtract them, etc. So for example city names or

zip codes are discrete values (although zip codes are numbers etc).

Second, the queries are consist of equality or inequality (range)

conditions.

So actually there are four instances of the problem:

| equality conditions | inequality conditions |

================================================================

discrete values | A | D |

numeric values | C | B |

----------------------------------------------------------------

I think those four problems should be treated separately, although some

of the techniques may be common.

A) discrete values and inequality conditions

One of the papers (A Bayesian Approach to Estimating The Selectivity

of Conjuctive Predicates) describes a quite interesting solution to

this problem - I've already posted a description on how to apply it

to the ZIP code / city name problem - see this

http://archives.postgresql.org/pgsql-hackers/2010-12/msg01576.php

B) numeric values and inequality conditions

Most of the papers dealing with this problem are based on

discretization and multi-dimensional histograms to approximate the

joint distribution. So I believe my initial proposal was not a

complete nonsense.

Once we have a working histogram-based solution, we can work on

precision and efficiency (how much space is needed to store it, how

long does it take to compute an estimate, etc.). There are two

ways to do that.

First, most of the papers offer an enhanced type of histogram

(although the histogram proposed in the paper is always evaluated as

the most efficient one, which is a bit suspicious). Anyway there are

advanced quite-promising ways to build multi-dimensional histograms.

Second, the paper "Selectivity estimation using probabilistic

models") describes a solution based on bayesian networks. That means

really really promising (actually it addresses join selectivity

estimation too).

So yeas, I'm quite confident we can solve this issue and improve the

efficiency - either by an advanced histogram or using bayesian

networks.

C) numeric values and equality conditions

OK, I'm not sure how to solve this case. But the queries against

numeric queries are range queries in most cases I guess, so maybe

that's not that big deal.

D) discrete values and inequality conditions

Basically, this can be handled just like numeric values after

discretization, i.e. using a histogram. But this is usually

E) a combination of discrete / numeric columns

I'm not sure how to deal with this. Obviously it's possible to

build multi-dimensional histogram, and estimate as many queries as

possible.

-----------------------------------------------------------------------

The papers describe some interesting alternative approaches, e.g. based

on SVD (singular value decomposition) or random sampling (or kernel

estimators, which an enhanced version of sampling).

But there are various issues associated with those solutions. SVD is

applicable to 2D only, so we'd be stuck with 2 columns, and random

sampling sounds a bit suspicious to me).

regards

Tomas

- Re: proposal : cross-column stats at 2010-12-13 19:22:38 from Tomas Vondra

From | Date | Subject | |
---|---|---|---|

Next Message | Robert Haas | 2010-12-17 20:50:52 | Re: unlogged tables vs. GIST |

Previous Message | Bill Moran | 2010-12-17 20:49:25 | Re: Why don't we accept exponential format for integers? |