Re: Multidimensional Histograms

From: Alexander Cheshev <alex(dot)cheshev(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multidimensional Histograms
Date: 2024-01-06 09:08:23
Message-ID: CAN_hQmvq-1XWZuxAEA+MHy0XGvmNxTc7002yFjmH2=CarrcQFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

I am sorry I didn't look into the code carefully. Indeed Postgres uses
Equi-Depth Histogram:

delta = (nvals - 1) / (num_hist - 1);

Regards,
Alexander Cheshev

On Sat, 6 Jan 2024 at 01:00, Alexander Cheshev <alex(dot)cheshev(at)gmail(dot)com> wrote:
>
> Hi Tomas,
>
> > Another reason was that the algorithm described in the two papers you
> > reference (1988 paper by DeWitt and the evaluation by Carlson and
> > Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> > processes the columns one by one - first build bucket on column "a",
> > then splits each bucket into buckets on "b". So it's not symmetrical,
> > and it results in lower accuracy compared to an "ideal" histogram with
> > the same number of buckets (especially for the dimensions split early).
>
> As stated in [1] Sum Square Error (SSE) is one of the most natural
> error metrics. Equi-Depth Histogram is not ideal because it doesn't
> minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
> SSE and from this point of view can be considered as an ideal
> solution.
>
> > This does indeed produce an equi-depth histogram, which seems nice, but
> > the mesh is regular in such a way that any value of the first dimension
> > intersects all buckets on the second dimension. So for example with a
> > histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> > 100 buckets on "b", each representing 1/10000 of tuples. But we don't
> > know how the tuples are distributed in the tuple - so we end up using
> > 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> > the wrong dimension.
>
> Suppose that we have a query box a=X and we know how data is
> distributed in buckets. Lets consider only the buckets which are
> intersected by the query box a=X. As we know how data is distributes
> in buckets we can exclude the buckets which have no tuples which
> intersect the query box a=X.
>
> As we store buckets with no information about data distribution we
> have to make reasonable assumptions. If we keep buckets relativly
> small then we can assume that buckets have uniform distribution.
>
> What I am trying to say is that the problem which you pointed out
> comes from the fact that we store buckets with no information about
> data distribution. Even in one dimensional case we have to assume how
> data is distributed in buckets. By the way Postgres assumes that data
> has uniform distribution in buckets.
>
> > However, this is not the only way to build an equi-depth histogram,
> > there are ways to make it more symmetrical. Also, it's not clear
> > equi-depth histograms are ideal with multiple columns. There are papers
> > proposing various other types of histograms (using different criteria to
> > build buckets optimizing a different metric). The most interesting one
> > seems to be V-Optimal histograms - see for example papers [1], [2], [3],
> > [4], [5] and [6]. I'm sure there are more. The drawback of course is
> > that it's more expensive to build such histograms.
>
> Tomas thank you for shearing with me your ideas regarding V-Optimal
> Histogram. I read through the papers which you gave me and came up
> with the following conclusion.
>
> The problem can be formulated like this. We have N tuples in
> M-dimensional space. We need to partition space into buckets
> iteratively until SSE is less than E or we reach the limit of buckets
> B.
>
> In the case of M-dimensional space it seems to me like an NP-hard
> problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
> sort N tuples in ascending order. Then we iteratively select a bucket
> which leads to the largest SSE reduction and split it into two parts.
> We repeat the process until SSE is less than E or we reach the limit
> of buckets B.
>
> If we assume that B is significantly less than N then the time
> complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
> equals 3, B equals 1000 and N equals 300*B then it will take slightly
> over 0.9*10^9 iterations to build a V-Optimal Histogram.
>
> You can see that we have to keep B as low as possible in order to
> build V-Optimal Histogram in feasible time. And here is a paradox.
> From one side we use V-Optimal Histogram in order to minimize SSE but
> from the other side we have to keep B as low as possible which
> eventually leads to increase in SSE.
>
> On the other hand time complexity required to build an Equi-Depth
> Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
> can be arbitrarily reduced by increasing B which in turn is only
> limited by the storage limit. Experimental results show low error
> metric [3].
>
> In Equi-Depth Histogram a bucket is represented by two vectors. The
> first vector points to the left bottom corner of the bucket and the
> other one point to the right top corner of the bucket. Thus space
> complexity of Equi-Depth Histogram can be estimated as
> 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
> integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
> bytes.
>
> If a bucket is partially intersected by a query box then we assume
> that data has uniform distribution inside of the bucket. It is a
> reasonable assumption if B is relativly large.
>
> In order to efficianly find buckets which intersect a query box we can
> store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
> takes O(logB) iterations to find buckets which intersect a query box.
> As storage requirements are dominated by leaf nodes we can assume that
> it takes slightly more than 2*integer_size*M*B.
>
> > IIRC the patch tried to do something like V-optimal histograms, but
> > using a greedy algorithm and not the exhaustive stuff described in the
> > various papers.
>
> We should only consider computationally tackable solutions. In one
> dimensional case V-Optimal Histogram is probably a good solution but
> in multi-dimensional case I would only consider Equi-Width or
> Equi-Depth Histograms. As stated in multiple papers Equi-Depth
> Histogram proves to be more accurate than Equi-Width Histogram. By the
> way Postgres uses something like Equi-Width Histogram.
>
> > FWIW I did not intend to reject the idea of adding multi-dimensional
> > histograms, but rather to provide some insight into the history of the
> > past attempt, and also point some weaknesses of the algorithm described
> > in the 1988 paper. If you choose to work on this, I can do a review etc.
>
> Thank you very much Tomas. I am new in the community and I definitely
> didn't expect to have such a warm welcome.
>
> As I indicated above Equi-Depth Histogram proves to be more accurate
> than Equi-Width Histogram and both have the same time and space
> requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
> that:
> * I will create a patch which will replace Equi-Width Histogram with
> Equi-Depth Histogram but only in 1-dimensional case.
> * I will show experimental results which will demonstrate improvement
> of selectivity estimation.
> Then will the path be accepted by the community?
>
> If the above path is accepted by the community then I will proceed
> further with M-dimensional Equi-Depth Histogram...
>
>
> [1] https://www.vldb.org/conf/1998/p275.pdf
> [2] https://www.vldb.org/conf/1997/P486.PDF
> [3] https://dl.acm.org/doi/pdf/10.1145/50202.50205
>
> Regards,
> Alexander Cheshev
>
> On Thu, 28 Dec 2023 at 15:25, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >
> > On 12/27/23 22:19, Tomas Vondra wrote:
> > > Hello Alexander,
> > >
> > > We actually had histograms in the early patches adding multivariate
> > > statistics [1]. We however ended up removing histograms and only kept
> > > the simpler types, for a couple reasons.
> > >
> > > It might be worth going through the discussion - I'm sure one of the
> > > reasons was we needed to limit the scope of the patch, and histograms
> > > were much more complex and possibly less useful compared to the other
> > > statistics types.
> > >
> > > ...
> >
> > FWIW I did not intend to reject the idea of adding multi-dimensional
> > histograms, but rather to provide some insight into the history of the
> > past attempt, and also point some weaknesses of the algorithm described
> > in the 1988 paper. If you choose to work on this, I can do a review etc.
> >
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Will Mortensen 2024-01-06 10:57:04 Re: Exposing the lock manager's WaitForLockers() to SQL
Previous Message Bertrand Drouvot 2024-01-06 09:03:52 Re: verify predefined LWLocks have entries in wait_event_names.txt