> a) We already use block based sampling to reduce overhead. If you're
> talking about using the entire block and not just randomly sampled
> tuples from within those blocks then your sample will be biased.
There are actually some really good equations to work with this, estimating
both row correlation and n-distinct based on sampling complete blocks.
See prior discussions on this list on N-distinct.
> 1) You have n^2 possible two-column combinations. That's a lot of
> processing and storage.
Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2.
> 2) It isn't even clear what data you're exactly looking for. Certainly
> "correlation" is just shorthand here and isn't what you're actually
> looking for.
Actually, I'd think that a correlation number estimate (0 = complete
uncorrelated, 1 = completely correlated) would be sufficient to improve
row count estimation significantly, without incurring the vast overhead of
> Those are all valid points, but pretty much orthogonal to what I'm on
> about today. The problems I'm thinking about are seen even when the
> planner's rowcount estimates are dead on (as indeed they mostly were
> in Philippe's example).
OK, I was afraid they were interdependant. You would know better than me.
> Well, it'll still exist as a GUC for the same reasons the other ones are
> GUCs. But I think the main reasons people have needed to twiddle it are
> (1) their database is completely RAM-resident (where RPC
> *should* logically be 1.0), or
> (2) they're trying to compensate for the overestimation of
> nestloop indexscans.
> The changes I'm proposing should help with (2).
Right. What I'm saying is that (1) should be derived from
estimated_cache_size and DBSIZE, not by setting an additional GUC.
> > (1) the frequency with which that index is accessed, modified by
> > (2) whether the index is a fraction of available ram, or larger than
> > ram e) the probability that the relevant table pages are cached in
> > memory, determined by the same two factors.
> These would all be nice things to know, but I'm afraid it's pie in the
> sky. We have no reasonable way to get those numbers. (And if we could
> get them, there would be another set of problems, namely plan
> instability: the planner's choices would become very difficult to
Hmmm ... I think those numbers are possible to infer, at least on a gross
level. Whether the cost of calculating them outweighs the benefit of
having them is another question, resolvable only through some
> a good place to start. The immediate problem is to get an
> infrastructure in place that gives us some numbers to apply equations
No arguments, of course.
PostgreSQL @ Sun
In response to
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2006-06-01 21:21:03|
|Subject: Re: Generalized concept of modules |
|Previous:||From: Martijn van Oosterhout||Date: 2006-06-01 20:45:39|
|Subject: Re: Generalized concept of modules|