Skip site navigation (1) Skip section navigation (2)

Re: More thoughts about planner's cost estimates

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: More thoughts about planner's cost estimates
Date: 2006-06-01 20:50:26
Message-ID: 200606011350.26364.josh@agliodbs.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Greg, Tom,

> a) We already use block based sampling to reduce overhead. If you're
> talking about using the entire block and not just randomly sampled
> tuples from within those blocks then your sample will be biased.

There are actually some really good equations to work with this, estimating 
both row correlation and n-distinct based on sampling complete blocks.  
See prior discussions on this list on N-distinct.

> 1) You have n^2 possible two-column combinations. That's a lot of
> processing and storage.

Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

> 2) It isn't even clear what data you're exactly looking for. Certainly
>    "correlation" is just shorthand here and isn't what you're actually
> looking for. 

Actually, I'd think that a correlation number estimate (0 = complete 
uncorrelated, 1 = completely correlated) would be sufficient to improve 
row count estimation significantly, without incurring the vast overhead of 
histogramXhistorgram manipulation.

> Those are all valid points, but pretty much orthogonal to what I'm on
> about today.  The problems I'm thinking about are seen even when the
> planner's rowcount estimates are dead on (as indeed they mostly were
> in Philippe's example).

OK, I was afraid they were interdependant.  You would know better than me.

> Well, it'll still exist as a GUC for the same reasons the other ones are
> GUCs.  But I think the main reasons people have needed to twiddle it are
> 	(1) their database is completely RAM-resident (where RPC
> 	    *should* logically be 1.0), or
> 	(2) they're trying to compensate for the overestimation of
> 	    nestloop indexscans.
> The changes I'm proposing should help with (2).

Right.  What I'm saying is that (1) should be derived from 
estimated_cache_size and DBSIZE, not by setting an additional GUC.

> > 	(1) the frequency with which that index is accessed, modified by
> > 	(2) whether the index is a fraction of available ram, or larger than
> > ram e) the probability that the relevant table pages are cached in
> > memory, determined by the same two factors.
>
> These would all be nice things to know, but I'm afraid it's pie in the
> sky.  We have no reasonable way to get those numbers.  (And if we could
> get them, there would be another set of problems, namely plan
> instability: the planner's choices would become very difficult to
> reproduce.)

Hmmm ... I think those numbers are possible to infer, at least on a gross 
level.  Whether the cost of calculating them outweighs the benefit of 
having them is another question, resolvable only through some 
experimentation.

> a good place to start.  The immediate problem is to get an
> infrastructure in place that gives us some numbers to apply equations
> to.

No arguments, of course.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2006-06-01 21:21:03
Subject: Re: Generalized concept of modules
Previous:From: Martijn van OosterhoutDate: 2006-06-01 20:45:39
Subject: Re: Generalized concept of modules

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group