Re: two dimensional statistics in Postgres

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Katharina Büchse <katharina(dot)buechse(at)uni-jena(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: two dimensional statistics in Postgres
Date: 2014-11-06 10:50:57
Message-ID: 545B5291.6080807@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/11/14 23:15, Katharina Büchse wrote:
> Hi,
>
> I'm a phd-student at the university of Jena, Thüringen, Germany, in
> the field of data bases, more accurate query optimization.
> I want to implement a system in PostgreSQL that detects column
> correlations and creates statistical data about correlated columns for
> the optimizer. Therefore I need to store two dimensional statistics
> (especially two dimensional histograms) in PostgreSQL.
> I had a look at the description of "WIP: multivariate statistics /
> proof of concept", which looks really promising, I guess these
> statistics are based on scans of the data? For my system I need both
> -- statistical data based on table scans (actually, samples are
> enough) and those based on query feedback. Query feedback (tuple
> counts and, speaking a little inaccurately, the where-part of the
> query itself) needs to be extracted and there needs to be a decision
> for the optimizer, when to take multivariate statistics and when to
> use the one dimensional ones. Oracle in this case just disables one
> dimensional histograms if there is already a multidimensional
> histogram, but this is not always useful, especially in the case of a
> feedback based histogram (which might not cover the whole data space).
> I want to use both kinds of histograms because correlations might
> occur only in parts of the data. In this case a histogram based on a
> sample of the whole table might not get the point and wouldn't help
> for the part of the data the user seems to be interested in.
> There are special data structures for storing multidimensional
> histograms based on feedback and I already tried to implement one of
> these in C. In the case of two dimensions they are of course not "for
> free" (one dimensional would be much cheaper), but based on the
> principle of maximum entropy they deliver really good results. I
> decided for only two dimensions because in this case we have the best
> proportion of cost and benefit when searching for correlation (here
> I'm relying on tests that were made in DB2 within a project called
> CORDS which detects correlations even between different tables).
>
> I'd be grateful for any advices and discussions.
> Regards,
>
> Katharina
>
>
Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?

Cheers,
Gavin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2014-11-06 10:51:45 Re: B-Tree index builds, CLUSTER, and sortsupport
Previous Message Etsuro Fujita 2014-11-06 10:44:45 Typo in comment