Re: Multidimensional Histograms

From: Alexander Cheshev <alex(dot)cheshev(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multidimensional Histograms
Date: 2024-01-07 22:55:01
Message-ID: CAN_hQmsNMK-8R2ier9prLXvJpTQ3bdke5D0KhP+QSY07giobeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

> The thing I was proposing is that it should be possible to build
> histograms with bins adapted to density in the given region. With
> smaller buckets in areas with high density. So that queries intersect
> with fewer buckets in low-density parts of the histogram.

This is how Equi-Depth Histogram works. Postgres has maller buckets in
areas with high density:

values[(i * (nvals - 1)) / (num_hist - 1)]

> I don't recall the details of the MHIST-2 scheme, but it's true
> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).

In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard
problem. In other words it is not possible to build it in polynomial
time. How did you come up with the estimate?!

> But that's exactly why greedy/approximate algorithms exist. Yes, it's
> not going to produce the optimal V-optimal histogram, but so what?

Greedy/approximate algorithm has time complexity O(M*N*B), where M
equals the number of dimensions. MHIST-2 is a greedy/approximate
algorithm.

> And how does this compare to the approximate/greedy algorithms, both in
> terms of construction time and accuracy?

Time complexity of Equi-Depth Histogram has no dependence on B.

> But all of this can apply to histograms in general, no? It's not somehow
> special to equi-depth histograms, a v-optimal histogram could be stored
> as an r-tree etc.

I agree.

Regards,
Alexander Cheshev

On Sun, 7 Jan 2024 at 00:54, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>
>
> On 1/6/24 01:00, Alexander Cheshev wrote:
> > Hi Tomas,
> >
> >> Another reason was that the algorithm described in the two papers you
> >> reference (1988 paper by DeWitt and the evaluation by Carlson and
> >> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> >> processes the columns one by one - first build bucket on column "a",
> >> then splits each bucket into buckets on "b". So it's not symmetrical,
> >> and it results in lower accuracy compared to an "ideal" histogram with
> >> the same number of buckets (especially for the dimensions split early).
> >
> > As stated in [1] Sum Square Error (SSE) is one of the most natural
> > error metrics. Equi-Depth Histogram is not ideal because it doesn't
> > minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
> > SSE and from this point of view can be considered as an ideal
> > solution.
> >
> >> This does indeed produce an equi-depth histogram, which seems nice, but
> >> the mesh is regular in such a way that any value of the first dimension
> >> intersects all buckets on the second dimension. So for example with a
> >> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> >> 100 buckets on "b", each representing 1/10000 of tuples. But we don't
> >> know how the tuples are distributed in the tuple - so we end up using
> >> 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> >> the wrong dimension.
> >
> > Suppose that we have a query box a=X and we know how data is
> > distributed in buckets. Lets consider only the buckets which are
> > intersected by the query box a=X. As we know how data is distributes
> > in buckets we can exclude the buckets which have no tuples which
> > intersect the query box a=X.
> >
> > As we store buckets with no information about data distribution we
> > have to make reasonable assumptions. If we keep buckets relativly
> > small then we can assume that buckets have uniform distribution.
> >
> > What I am trying to say is that the problem which you pointed out
> > comes from the fact that we store buckets with no information about
> > data distribution. Even in one dimensional case we have to assume how
> > data is distributed in buckets. By the way Postgres assumes that data
> > has uniform distribution in buckets.
> >
>
> It's not just what Postgres assumes, the assumption bucket uniformity is
> somewhat inherent to the whole concept of a histogram. Yes, maybe we
> could keep some "distribution" info about each bucket, but then maybe we
> could simply build histogram with more bins occupying the same space?
>
> The thing I was proposing is that it should be possible to build
> histograms with bins adapted to density in the given region. With
> smaller buckets in areas with high density. So that queries intersect
> with fewer buckets in low-density parts of the histogram.
>
> >> However, this is not the only way to build an equi-depth histogram,
> >> there are ways to make it more symmetrical. Also, it's not clear
> >> equi-depth histograms are ideal with multiple columns. There are papers
> >> proposing various other types of histograms (using different criteria to
> >> build buckets optimizing a different metric). The most interesting one
> >> seems to be V-Optimal histograms - see for example papers [1], [2], [3],
> >> [4], [5] and [6]. I'm sure there are more. The drawback of course is
> >> that it's more expensive to build such histograms.
> >
> > Tomas thank you for shearing with me your ideas regarding V-Optimal
> > Histogram. I read through the papers which you gave me and came up
> > with the following conclusion.
> >
> > The problem can be formulated like this. We have N tuples in
> > M-dimensional space. We need to partition space into buckets
> > iteratively until SSE is less than E or we reach the limit of buckets
> > B.
> >
>
> Yes. Although v-optimal histograms minimize variance of frequencies. Not
> sure if that's what you mean by SSE.
>
> > In the case of M-dimensional space it seems to me like an NP-hard
> > problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
> > sort N tuples in ascending order. Then we iteratively select a bucket
> > which leads to the largest SSE reduction and split it into two parts.
> > We repeat the process until SSE is less than E or we reach the limit
> > of buckets B.
> >
>
> I don't recall all the details of the MHIST-2 algorithm, but this sounds
> about right. Yes, building the optimal histogram would be NP-hard, so
> we'd have to use some approximate / greedy algorithm.
>
> > If we assume that B is significantly less than N then the time
> > complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
> > equals 3, B equals 1000 and N equals 300*B then it will take slightly
> > over 0.9*10^9 iterations to build a V-Optimal Histogram.
> >
> > You can see that we have to keep B as low as possible in order to
> > build V-Optimal Histogram in feasible time. And here is a paradox.
> > From one side we use V-Optimal Histogram in order to minimize SSE but
> > from the other side we have to keep B as low as possible which
> > eventually leads to increase in SSE.
> >
>
> I don't recall the details of the MHIST-2 scheme, but it's true
> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).
>
> But that's exactly why greedy/approximate algorithms exist. Yes, it's
> not going to produce the optimal V-optimal histogram, but so what?
>
> > On the other hand time complexity required to build an Equi-Depth
> > Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
> > can be arbitrarily reduced by increasing B which in turn is only
> > limited by the storage limit. Experimental results show low error
> > metric [3].
> >
>
> And how does this compare to the approximate/greedy algorithms, both in
> terms of construction time and accuracy?
>
> The [3] paper does not compare those, ofc, and I'm somewhat skeptical
> about the results in that paper. Not that they'd be wrong, but AFAICS
> the assumptions are quite simplistic and well-suited for that particular
> type of histogram.
>
> It's far from clear how would it perform for less "smooth" data,
> non-numeric data, etc.
>
> > In Equi-Depth Histogram a bucket is represented by two vectors. The
> > first vector points to the left bottom corner of the bucket and the
> > other one point to the right top corner of the bucket. Thus space
> > complexity of Equi-Depth Histogram can be estimated as
> > 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
> > integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
> > bytes.
> >
> > If a bucket is partially intersected by a query box then we assume
> > that data has uniform distribution inside of the bucket. It is a
> > reasonable assumption if B is relativly large.
> >
> > In order to efficianly find buckets which intersect a query box we can
> > store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
> > takes O(logB) iterations to find buckets which intersect a query box.
> > As storage requirements are dominated by leaf nodes we can assume that
> > it takes slightly more than 2*integer_size*M*B.
> >
>
> But all of this can apply to histograms in general, no? It's not somehow
> special to equi-depth histograms, a v-optimal histogram could be stored
> as an r-tree etc.
>
> >> IIRC the patch tried to do something like V-optimal histograms, but
> >> using a greedy algorithm and not the exhaustive stuff described in the
> >> various papers.
> >
> > We should only consider computationally tackable solutions. In one
> > dimensional case V-Optimal Histogram is probably a good solution but
> > in multi-dimensional case I would only consider Equi-Width or
> > Equi-Depth Histograms. As stated in multiple papers Equi-Depth
> > Histogram proves to be more accurate than Equi-Width Histogram. By the
> > way Postgres uses something like Equi-Width Histogram.
> >
>
> Sure, but AFAIK the greedy/approximate algorithms are not intractable.
>
> And at some point it becomes a tradeoff between accuracy of estimates
> and resources to build the histogram. Papers from ~1988 are great, but
> maybe not sufficient to make such decisions now.
>
> >> FWIW I did not intend to reject the idea of adding multi-dimensional
> >> histograms, but rather to provide some insight into the history of the
> >> past attempt, and also point some weaknesses of the algorithm described
> >> in the 1988 paper. If you choose to work on this, I can do a review etc.
> >
> > Thank you very much Tomas. I am new in the community and I definitely
> > didn't expect to have such a warm welcome.
> >
> > As I indicated above Equi-Depth Histogram proves to be more accurate
> > than Equi-Width Histogram and both have the same time and space
> > requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
> > that:
> > * I will create a patch which will replace Equi-Width Histogram with
> > Equi-Depth Histogram but only in 1-dimensional case.
> > * I will show experimental results which will demonstrate improvement
> > of selectivity estimation.
> > Then will the path be accepted by the community?
> >
> > If the above path is accepted by the community then I will proceed
> > further with M-dimensional Equi-Depth Histogram...
> >
>
> I find it very unlikely we want (or even need) to significantly rework
> the 1-D histograms we already have. AFAICS the current histograms work
> pretty well, and if better accuracy is needed, it's easy/efficient to
> just increase the statistics target. I'm sure there are various
> estimation issues, but I'm not aware of any that'd be solved simply by
> using a different 1-D histogram.
>
> I'm not going to reject that outright - but I think the bar you'd need
> to justify such change is damn high. Pretty much everyone is using the
> current histograms, which makes it a very sensitive area. You'll need to
> show that it helps in practice, without hurting causing harm ...
>
> It's an interesting are for experiments, no doubt about it. And if you
> choose to explore it, that's fine. But it's better to be aware it may
> not end with a commit.
>
>
> For the multi-dimensional case, I propose we first try to experiment
> with the various algorithms, and figure out what works etc. Maybe
> implementing them in python or something would be easier than C.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-01-07 23:29:01 Re: Multidimensional Histograms
Previous Message Joe Conway 2024-01-07 18:51:50 Re: Password leakage avoidance