WIP: cross column correlation, 2nd shot

From: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: ants(at)cybertec(dot)at, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject: WIP: cross column correlation, 2nd shot
Date: 2012-03-13 13:56:10
Message-ID: D9D4375E-6B24-4BB2-916A-50208EF15B90@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

hello,

we (zoli, ants and myself) have put some more effort into our struggle to make cross column estimates better and more reasonable.
this is supposed to fix problems related to columns which are not "independent" (such as "time of the year" and temperature and so on).
this patch is far from ready but does some nice and useful stuff already:

Here's the cross-col patch against todays master branch. Also, a data
generator that creates a few data distributions for testing. Just run
it on an empty db, it will create 3 tables, expdist, normdist and
weigheddist. A good example would be:

EXPLAIN ANALYZE SELECT * FROM weigheddist WHERE a = 'a' AND b = 3; --
estimates 1 rows, actual 0
DROP CROSS COLUMN STATISTICS ON TABLE weigheddist (a, b);
EXPLAIN ANALYZE SELECT * FROM weigheddist WHERE a = 'a' AND b = 3; --
estimates 161k rows

Some notes:
* Currently only exact equality operators on one table are supported.
* The operators = ANY (ie blah IN (1,2,3)), != and IS NULL would be
quite simple to estimate with the current design.
* Range ops (var BETWEEN c1 AND c2) would require decent
multi-dimensional histogram construction. The previous MCV based
bucket selection mechanism created pretty useless histograms for cases
where there are more values than mcv buckets (should most cases where
range ops are useful). I found some papers with algorithms for good
multidimensional histogram construction.
* I'll look if there are any decent algorithms for "functional dependency"
stats (given tuples t1 and t2, calculate P(t1.b = t2.b|t1.a = t2.a)),
seems like a very useful stat to estimate joins.
* The stats catalog design seems top hacky, probably needs a good
rethink. Maybe a separate catalog for multi attribute stats. I'd like
to work out a decent roadmap where this feature is going before
redesigning it, otherwise multi table stats will require another
redesign.
* A decent set of real-worldish usecases would be really good to
double check if I'm on the right track.

one more thing:
ants has come up with the idea of making the stats collecting engine for more complicated multi-table-cross correlation plugable; we should ship a default collector but allow people to put in some other functionality if needed (could make sense for some not-so-easy set of data and some nasty corner cases). any thoughts on that one?

many thanks,

hans

Attachment Content-Type Size
cross-col-stats.v1.patch application/octet-stream 169.4 KB
generate_data.py text/x-python-script 2.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2012-03-13 13:59:31 Re: patch for parallel pg_dump
Previous Message Robert Haas 2012-03-13 13:53:13 Re: Measuring relation free space