From: | Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> |
---|---|
To: | Katharina Büchse <katharina(dot)buechse(at)uni-jena(dot)de>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: two dimensional statistics in Postgres |
Date: | 2014-11-06 10:50:57 |
Message-ID: | 545B5291.6080807@archidevsys.co.nz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 06/11/14 23:15, Katharina Büchse wrote:
> Hi,
>
> I'm a phd-student at the university of Jena, Thüringen, Germany, in
> the field of data bases, more accurate query optimization.
> I want to implement a system in PostgreSQL that detects column
> correlations and creates statistical data about correlated columns for
> the optimizer. Therefore I need to store two dimensional statistics
> (especially two dimensional histograms) in PostgreSQL.
> I had a look at the description of "WIP: multivariate statistics /
> proof of concept", which looks really promising, I guess these
> statistics are based on scans of the data? For my system I need both
> -- statistical data based on table scans (actually, samples are
> enough) and those based on query feedback. Query feedback (tuple
> counts and, speaking a little inaccurately, the where-part of the
> query itself) needs to be extracted and there needs to be a decision
> for the optimizer, when to take multivariate statistics and when to
> use the one dimensional ones. Oracle in this case just disables one
> dimensional histograms if there is already a multidimensional
> histogram, but this is not always useful, especially in the case of a
> feedback based histogram (which might not cover the whole data space).
> I want to use both kinds of histograms because correlations might
> occur only in parts of the data. In this case a histogram based on a
> sample of the whole table might not get the point and wouldn't help
> for the part of the data the user seems to be interested in.
> There are special data structures for storing multidimensional
> histograms based on feedback and I already tried to implement one of
> these in C. In the case of two dimensions they are of course not "for
> free" (one dimensional would be much cheaper), but based on the
> principle of maximum entropy they deliver really good results. I
> decided for only two dimensions because in this case we have the best
> proportion of cost and benefit when searching for correlation (here
> I'm relying on tests that were made in DB2 within a project called
> CORDS which detects correlations even between different tables).
>
> I'd be grateful for any advices and discussions.
> Regards,
>
> Katharina
>
>
Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?
Cheers,
Gavin
From | Date | Subject | |
---|---|---|---|
Next Message | Andreas Karlsson | 2014-11-06 10:51:45 | Re: B-Tree index builds, CLUSTER, and sortsupport |
Previous Message | Etsuro Fujita | 2014-11-06 10:44:45 | Typo in comment |