From: | Hans-Jürgen Schönig <postgres(at)cybertec(dot)at> |
---|---|
To: | pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | ants(at)cybertec(dot)at, Boszormenyi Zoltan <zb(at)cybertec(dot)at> |
Subject: | WIP: cross column correlation, 2nd shot |
Date: | 2012-03-13 13:56:10 |
Message-ID: | D9D4375E-6B24-4BB2-916A-50208EF15B90@cybertec.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
hello,
we (zoli, ants and myself) have put some more effort into our struggle to make cross column estimates better and more reasonable.
this is supposed to fix problems related to columns which are not "independent" (such as "time of the year" and temperature and so on).
this patch is far from ready but does some nice and useful stuff already:
Here's the cross-col patch against todays master branch. Also, a data
generator that creates a few data distributions for testing. Just run
it on an empty db, it will create 3 tables, expdist, normdist and
weigheddist. A good example would be:
EXPLAIN ANALYZE SELECT * FROM weigheddist WHERE a = 'a' AND b = 3; --
estimates 1 rows, actual 0
DROP CROSS COLUMN STATISTICS ON TABLE weigheddist (a, b);
EXPLAIN ANALYZE SELECT * FROM weigheddist WHERE a = 'a' AND b = 3; --
estimates 161k rows
Some notes:
* Currently only exact equality operators on one table are supported.
* The operators = ANY (ie blah IN (1,2,3)), != and IS NULL would be
quite simple to estimate with the current design.
* Range ops (var BETWEEN c1 AND c2) would require decent
multi-dimensional histogram construction. The previous MCV based
bucket selection mechanism created pretty useless histograms for cases
where there are more values than mcv buckets (should most cases where
range ops are useful). I found some papers with algorithms for good
multidimensional histogram construction.
* I'll look if there are any decent algorithms for "functional dependency"
stats (given tuples t1 and t2, calculate P(t1.b = t2.b|t1.a = t2.a)),
seems like a very useful stat to estimate joins.
* The stats catalog design seems top hacky, probably needs a good
rethink. Maybe a separate catalog for multi attribute stats. I'd like
to work out a decent roadmap where this feature is going before
redesigning it, otherwise multi table stats will require another
redesign.
* A decent set of real-worldish usecases would be really good to
double check if I'm on the right track.
one more thing:
ants has come up with the idea of making the stats collecting engine for more complicated multi-table-cross correlation plugable; we should ship a default collector but allow people to put in some other functionality if needed (could make sense for some not-so-easy set of data and some nasty corner cases). any thoughts on that one?
many thanks,
hans
Attachment | Content-Type | Size |
---|---|---|
cross-col-stats.v1.patch | application/octet-stream | 169.4 KB |
generate_data.py | text/x-python-script | 2.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2012-03-13 13:59:31 | Re: patch for parallel pg_dump |
Previous Message | Robert Haas | 2012-03-13 13:53:13 | Re: Measuring relation free space |