Re: SP-GiST for ranges based on 2d-mapping and quad-tree

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SP-GiST for ranges based on 2d-mapping and quad-tree
Date: 2012-07-28 21:33:16
Message-ID: 50145A9C.7080400@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23.07.2012 10:37, Alexander Korotkov wrote:
> On Fri, Jul 20, 2012 at 3:48 PM, Heikki Linnakangas<
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> It would be nice to have an introduction, perhaps as a file comment at the
>> top of rangetypes_spgist.c, explaining how the quad tree works. I have a
>> general idea of what a quad tree is, but it's not immediately obvious how
>> it maps to SP-GiST. What is stored on a leaf node and an internal node?
>> What is the 'prefix' (seems to be the centroid)? How are ranges mapped to
>> 2D points? (the function comment of getQuadrant() is a good start for that
>> last one)
>
> I've added some comments at the top of rangetypes_spgist.c.

Thanks, much better.

I think the handling of empty ranges needs some further explanation. If
I understand the code correctly, the root node can contain a centroid
like usual, and empty ranges are placed in the magic 5th quadrant.
Alternatively, the root node has no centroid, and contains only two
subnodes: all empty ranges are placed under one subnode, and non-empty
ranges under the other.

It seems it would be simpler if we always stored empty nodes the latter
way, with a no-centroid root node, and nodes with a centroid would
always only have 4 subnodes. When the first empty range is added to an
index that already contains non-empty values, the choose-function could
return spgSplitTuple to create a new root node that divides the space
into empty and non-empty values. Then again, I don't think it would
actually simplify the code much. Handling the 5th quadrant doesn't
require much code in spg_range_quad_inner_consistent() as it is. So
maybe it's better to leave it the way it is.

Or perhaps we should stipulate that a root node with no centroid can
only contain empty ranges. When you add the first non-empty range, have
choose function return spgSplitTuple, and create a new root node with a
centroid, and the empty ranges in the 5th quadrant.

>> In spg_range_quad_inner_**consistent(), if in->hasPrefix == true, ISTM that
>> in all cases where 'empty' is true, 'which' is set to 0, meaning that there
>> can be no matches in any of the quadrants. In most of the case-branches,
>> you explicitly check for 'empty', but even in the ones where you don't, I
>> think you end up setting which=0 if empty==true. I'm not 100% sure about
>> the RANGESTRAT_ADJACENT case, though. Am I missing something?
>
> Ops., it was a bug: RANGESTRAT_ADJACENT shoud set which=0 if empty==true,
> while RANGESTRAT_CONTAINS and RANGESTRAT_CONTAINED_BY not. Corrected.

Ok. I did some copy-editing, rewording some comments, and fixing
whitespace. Patch attached, hope I didn't break anything.

I think the most difficult part of the patch is the
spg_range_quad_inner_consistent() function. It's hard to understand how
the various strategies are implemented. I started to expand the comments
in for each strategy, explaining how each strategy maps to a bounding
box in the 2D space, but I'm not sure how much that actually helps.
Perhaps it would help to also restructure the code so that you first
have a switch-case statement that maps the scan key to bounding box,
without worrying about the centroid yet, and then check which quadrants
you need to descend to to find points within the box. The
adjacent-strategy is more complicated than a simple bounding-box search,
though.

I'm having trouble understanding how exactly the RANGESTRAT_ADJACENT
works. The geometric interpretation of 'adjacent' is that the scan key
defines two lines, one horizontal and one vertical. Any points that lie
on either of the lines match the query. Looking at the implementation,
it's not clear how the code implements the search for along those two lines.

Also, I wonder if we really need to reconstruct the "previous" value in
a RANGESTRAT_ADJACENT search. ISTM we only need to remember which of the
two lines we are chasing. For example, if you descend to quadrant 2
because there might be a point there that lies on the horizontal line,
but we already know that there can't be any points there lie on the
vertical line, you only need to remember that, not the whole centroid
from the previous level. Does the SP-GiST API require the
"reconstructed" values stored by inner_consistent to be of the correct
datatype, or can it store any Datums in the array?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
range_spgist_quad-0.6-heikki.patch text/x-diff 48.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-07-28 21:46:56 Re: Covering Indexes
Previous Message Jeff Janes 2012-07-28 20:09:39 effective_io_concurrency