Re: Simple join optimized badly?

From: Mark Lewis <mark(dot)lewis(at)mir3(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 15:07:40
Message-ID: 1160579260.8082.121.camel@archimedes
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


I'm interested in the problem of cross-column statistics from a
theoretical perspective. It would be interesting to sit down and try to
reason out a useful solution, or at very least to understand the problem
better so I can anticipate when it might come and eat me.

>From my understanding, the main problem is that if PG knows the
selectivity of n conditions C1,C2,...,Cn then it doesn't know whether
the combined selectivity will be C1*C2*...*Cn (conditions are
independent) or max(C1,C2,...,Cn) (conditions are strictly dependent),
or somewhere in the middle. Therefore, row estimates could be orders of
magnitude off.

I suppose a common example would be a table with a serial primary key
column and a timestamp value which is always inserted as
CURRENT_TIMESTAMP, so the two columns are strongly correlated. If the
planner guesses that 1% of the rows of the table will match pk>1000000,
and 1% of the rows of the table will match timestamp > X, then it would
be nice for it to know that if you specify both "pk>1000000 AND
timestamp>X" that the combined selectivity is still only 1% and not 1% *
1% = 0.01%.

As long as I'm sitting down and reasoning about the problem anyway, are
there any other types of cases you're aware of where some form of cross-
column statistics would be useful? In the unlikely event that I
actually come up with a brilliant and simple solution, I'd at least like
to make sure that I'm solving the right problem :)

Mark Lewis

On Tue, 2006-10-10 at 22:38 -0400, Tom Lane wrote:
> Brian Herlihy <btherl(at)yahoo(dot)com(dot)au> writes:
> > What would it take for hints to be added to postgres?
> A *whole lot* more thought and effort than has been expended on the
> subject to date.
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y". You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).
> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions. Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.
> regards, tom lane
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Brendan Curran 2006-10-11 16:53:41 Re: Scrub one large table against another
Previous Message Bruce Momjian 2006-10-11 14:53:30 Re: Simple join optimized badly?