Re: GiST penalty functions [PoC]

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: amborodin(at)acm(dot)org
Cc: Greg Stark <stark(at)mit(dot)edu>, Михаил Бахтерев <mob(at)k(dot)imm(dot)uran(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Михаил Бахтерев <mike(dot)bakhterev(at)gmail(dot)com>
Subject: Re: GiST penalty functions [PoC]
Date: 2016-09-21 07:27:14
Message-ID: 765e8a2e-1a7d-2648-15f9-021e6ff3a433@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/14/2016 07:57 PM, Andrew Borodin wrote:
> Here is v6 of cube penalty patch.
> There was a warning about unused edge function under systems without
> __STDC_IEC_559__ defined, this patch fixes it.

Thanks!

Reading through this thread again, I think we got a bit side-tracked
with this float bit-twiddling aspect, and we haven't really discussed
the algorithm itself. So:

On 08/29/2016 09:04 AM, Andrew Borodin wrote:
> Some time ago I put a patch to commitfest that optimizes slightly GiST
> inserts and updates. It's effect in general is quite modest (15% perf
> improved), but for sorted data inserts are 3 times quicker. This
> effect I attribute to mitigation of deficiencies of old split
> algorithm.
> Alexander Korotkov advised me to lookup some features of the RR*-tree.
>
> The RR*-tree differs from combination of Gist + cube extension in two
> algorithms:
> 1. Algorithm to choose subtree for insertion
> 2. Split algorithm

Got a link for a description of the RR*-tree? Would be good to reference
it in the patch comments, too.

> This is the message regarding implementation of first one in GiST. I
> think that both of this algorithms will improve querying performance.
>
> Long story short, there are two problems in choose subtree algorithm.
> 1. GiST chooses subtree via penalty calculation for each entry,
> while RR*-tree employs complicated conditional logic: when there are
> MBBs (Minimum Bounding Box) which do not require extensions, smallest
> of them is chosen; in some cases, entries are chosen not by volume,
> but by theirs margin.
> 2. RR*-tree algorithm jumps through entry comparation non-linearly,
> I think this means that GiST penalty computation function will have to
> access other entries on a page.
>
> In this message I address only first problem. I want to construct a
> penalty function that will choose:
> 1. Entry with a zero volume and smallest margin, that can
> accommodate insertion tuple without extension, if one exists;
> 2. Entry with smallest volume, that can accommodate insertion tuple
> without extension, if one exists;
> 3. Entry with zero volume increase after insert and smallest margin
> increase, if one exists;
> 4. Entry with minimal volume increase after insert.
>
> Current cube behavior inspects only 4th option by returning as a
> penalty (float) MBB’s volume increase. To implement all 4 options, I
> use a hack: order of positive floats corresponds exactly to order of
> integers with same binary representation (I’m not sure this stands for
> every single supported platform). So I use two bits of float as if it
> were integer, and all others are used as shifted bits of corresponding
> float: option 4 – volume increase, 3 - margin increase, 2 – MBB
> volume, 1 – MBB margin. You can check the reinterpretation cast
> function pack_float() in the patch.

If I understand correctly, cases #1 and #3 arise when one of the
dimensions is 0. For example, in a 3D space, if the existing entry is a
rectangle on a plane, with zero-length edge on one of the dimensions,
and the new entry is on the same plane. Case #1 arises if the new entry
falls within that rectangle, and case #3 if it's outside it. Currently,
we treat all such cases as 0-penalty, because the degenerate 0-dimension
causes all the calculated volumes to become zero. So clearly we can do
better, which is what this patch does.

At first blush, I'm surprised that you switch to using the sum of the
edges in those cases. I would expect to ignore the degenerate
0-dimension, and calculate the volume using the other dimensions. So in
a 3D space, you would calculate the increase of the area of the
rectangle (A*B), not the sum of the edges (A+B). And it probably would
be best to take into account how many of the dimensions are zero. So in
a 3D space, if there is an existing line segment that the new point
falls into, and also a rectangle that the new point falls into, you
should prefer the 1-dimensional line segment over the 2-dimensional
rectangle.

I don't know how big a difference that makes in practice. But it seems
odd that if you e.g. have a set of 3D points, but the Z dimension in all
of the points is 0, the algorithm behaves differently than if you had
the exact same points in a 2D space.

(If this is explained in the RR*-tree paper, feel free to just point me
to that and I'll ask again if I have further questions.)

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-09-21 07:32:15 Re: Speedup twophase transactions
Previous Message Vik Fearing 2016-09-21 07:18:09 Re: Quorum commit for multiple synchronous replication.