Quick Links

Re: two dimensional statistics in Postgres

From:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To:	Katharina Büchse <katharina(dot)buechse(at)uni-jena(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: two dimensional statistics in Postgres
Date:	2014-11-06 10:50:57
Message-ID:	545B5291.6080807@archidevsys.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 06/11/14 23:15, Katharina Büchse wrote:
> Hi,
>
> I'm a phd-student at the university of Jena, Thüringen, Germany, in
> the field of data bases, more accurate query optimization.
> I want to implement a system in PostgreSQL that detects column
> correlations and creates statistical data about correlated columns for
> the optimizer. Therefore I need to store two dimensional statistics
> (especially two dimensional histograms) in PostgreSQL.
> I had a look at the description of "WIP: multivariate statistics /
> proof of concept", which looks really promising, I guess these
> statistics are based on scans of the data? For my system I need both
> -- statistical data based on table scans (actually, samples are
> enough) and those based on query feedback. Query feedback (tuple
> counts and, speaking a little inaccurately, the where-part of the
> query itself) needs to be extracted and there needs to be a decision
> for the optimizer, when to take multivariate statistics and when to
> use the one dimensional ones. Oracle in this case just disables one
> dimensional histograms if there is already a multidimensional
> histogram, but this is not always useful, especially in the case of a
> feedback based histogram (which might not cover the whole data space).
> I want to use both kinds of histograms because correlations might
> occur only in parts of the data. In this case a histogram based on a
> sample of the whole table might not get the point and wouldn't help
> for the part of the data the user seems to be interested in.
> There are special data structures for storing multidimensional
> histograms based on feedback and I already tried to implement one of
> these in C. In the case of two dimensions they are of course not "for
> free" (one dimensional would be much cheaper), but based on the
> principle of maximum entropy they deliver really good results. I
> decided for only two dimensions because in this case we have the best
> proportion of cost and benefit when searching for correlation (here
> I'm relying on tests that were made in DB2 within a project called
> CORDS which detects correlations even between different tables).
>
> I'd be grateful for any advices and discussions.
> Regards,
>
> Katharina
>
>
Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?

Cheers,
Gavin

In response to

two dimensional statistics in Postgres at 2014-11-06 10:15:26 from Katharina Büchse

Responses

Re: two dimensional statistics in Postgres at 2014-11-06 10:57:30 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andreas Karlsson	2014-11-06 10:51:45	Re: B-Tree index builds, CLUSTER, and sortsupport
Previous Message	Etsuro Fujita	2014-11-06 10:44:45	Typo in comment