Quick Links

WIP: cross column correlation, 2nd shot

From:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	ants(at)cybertec(dot)at, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	WIP: cross column correlation, 2nd shot
Date:	2012-03-13 13:56:10
Message-ID:	D9D4375E-6B24-4BB2-916A-50208EF15B90@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

hello,

we (zoli, ants and myself) have put some more effort into our struggle to make cross column estimates better and more reasonable.
this is supposed to fix problems related to columns which are not "independent" (such as "time of the year" and temperature and so on).
this patch is far from ready but does some nice and useful stuff already:

Here's the cross-col patch against todays master branch. Also, a data
generator that creates a few data distributions for testing. Just run
it on an empty db, it will create 3 tables, expdist, normdist and
weigheddist. A good example would be:

EXPLAIN ANALYZE SELECT * FROM weigheddist WHERE a = 'a' AND b = 3; --
estimates 1 rows, actual 0
DROP CROSS COLUMN STATISTICS ON TABLE weigheddist (a, b);
EXPLAIN ANALYZE SELECT * FROM weigheddist WHERE a = 'a' AND b = 3; --
estimates 161k rows

Some notes:
* Currently only exact equality operators on one table are supported.
* The operators = ANY (ie blah IN (1,2,3)), != and IS NULL would be
quite simple to estimate with the current design.
* Range ops (var BETWEEN c1 AND c2) would require decent
multi-dimensional histogram construction. The previous MCV based
bucket selection mechanism created pretty useless histograms for cases
where there are more values than mcv buckets (should most cases where
range ops are useful). I found some papers with algorithms for good
multidimensional histogram construction.
* I'll look if there are any decent algorithms for "functional dependency"
stats (given tuples t1 and t2, calculate P(t1.b = t2.b|t1.a = t2.a)),
seems like a very useful stat to estimate joins.
* The stats catalog design seems top hacky, probably needs a good
rethink. Maybe a separate catalog for multi attribute stats. I'd like
to work out a decent roadmap where this feature is going before
redesigning it, otherwise multi table stats will require another
redesign.
* A decent set of real-worldish usecases would be really good to
double check if I'm on the right track.

one more thing:
ants has come up with the idea of making the stats collecting engine for more complicated multi-table-cross correlation plugable; we should ship a default collector but allow people to put in some other functionality if needed (could make sense for some not-so-easy set of data and some nasty corner cases). any thoughts on that one?

many thanks,

hans

Attachment	Content-Type	Size
cross-col-stats.v1.patch	application/octet-stream	169.4 KB
generate_data.py	text/x-python-script	2.8 KB

Responses

Re: WIP: cross column correlation, 2nd shot at 2012-03-14 17:32:46 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kevin Grittner	2012-03-13 13:59:31	Re: patch for parallel pg_dump
Previous Message	Robert Haas	2012-03-13 13:53:13	Re: Measuring relation free space