Re: GiST range-contained-by searches versus empty ranges

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: GiST range-contained-by searches versus empty ranges
Date: 2011-11-27 00:26:55
Message-ID: 8565.1322353615@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I started to wonder why the test in range_gist_consistent_int() for
> RANGESTRAT_CONTAINED_BY was "return true" (ie, search the entire index)
> rather than range_overlaps, which is what is tested in the comparable
> case in rtree_internal_consistent(). The regression tests showed me
> how come: an empty-range index entry should be reported as being
> contained by anything, but range_overlaps won't necessarily find such
> entries. (The rtree case is all right because there is no equivalent of
> an empty range in boxes, circles, or polygons.)

> I am not satisfied with this state of affairs though: range_contained_by
> really ought to be efficiently indexable, but with the current coding
> an index search is nearly useless. Also, so far as I can see, the
> current penalty function allows empty-range entries to be scattered
> basically anywhere in the index, making a search for "= empty" pretty
> inefficient too.

> The first solution that comes to mind is to make the penalty and
> picksplit functions forcibly segregate empty ranges from others, that is
> a split will never put empty ranges together with non-empty ones. Then,
> we can assume that a non-empty internal node doesn't represent any empty
> leaf entries, and avoid descending to it when it doesn't overlap the
> target range. Then the equality-search case could be improved too.

After looking through the code a bit, it seems that this isn't possible
to solve with a localized fix after all. Yes, the opclass-specific
picksplit function can positively guarantee to separate empty ranges
from others when it splits a page ... but it doesn't have control over
what happens when an item is added to an index without a page split
occurring. Consider a scenario where we have a multi-page GiST index
containing no empty ranges, and an empty range item needs to be added.
As the code is currently written, the central GiST code will choose one
of the existing root-page items to add the new item to; there is no way
to persuade it that a new top-level item would be a better idea. What's
more, if we try to do so by having the penalty function return infinity
for adding an empty range to a nonempty item, we'll end up with empty
ranges being randomly added to any/all of the items, since we'll get the
exact same penalty for all the nonempty items. So, far from segregating
the empties, we'll just end up with them cluttered all over the index,
just like now.

This is not really a new problem. There is code in there that is trying
to segregate NULLs from non-NULL entries, and it is equally
broken/ineffective.

I'm inclined to propose that we should add some logic to say that
merging a new item into an existing one is forbidden if the penalty
function returns plus-infinity for the case. If all existing items on a
page return infinity, a new item must be added to the page (possibly
causing a page split) instead of inserting into any existing one.
(Of course, gistpenalty() should be fixed to return infinity, not just a
randomly chosen large value, for the null-and-not-null case.)

Comments?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2011-11-27 03:36:54 pgsql: Move pg_dump memory routines into pg_dumpmem.c/h and restore com
Previous Message Kevin Grittner 2011-11-26 23:58:44 Re: Avoiding repeated snapshot computation