Re: BUG #6399: knngist sometimes returns tuples in incorrect order

From: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
To: heikki(dot)linnakangas(at)enterprisedb(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: BUG #6399: knngist sometimes returns tuples in incorrect order
Date: 2012-01-20 03:40:09
Message-ID: 20120120034009.EA32014A16D@mail.netbsd.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

hi,

> On 19.01.2012 06:35, YAMAMOTO Takashi wrote:
>>> On 18.01.2012 14:07, Heikki Linnakangas wrote:
>>>> For 9.2, I think we should change gist so
>>>> that the user-defined distance function can return any scalar data type,
>>>> not just float8 (as long as it has b-tree comparison operators).
>>>
>>> I took a shot at doing that. Patch attached. It needs some cleanup; I
>>> think we'll need to bump the version of the btree_gist extension, and I
>>> only changed the btree_int8 distance function to do that, even though it
>>> would now make a lot of sense to adjust the others, too. Also, I think
>>> it'll leak memory (or crash..) if the distance data type is pass-by-ref.
>>>
>>> If someone has a test suite at hand for performance testing this, that
>>> would be good. Now that I'm calling the b-tree comparison function for
>>> every comparison operation in the red-black tree, instead of a direct
>>> float8<= instruction, this could be quite a bit slower. That could be
>>> mitigated somewhat by using the sort-support stuff Tom committed recently.
>>>
>>> If we bend things a bit, this might be back-patchable to 9.1. We can't
>>> add a new am support function easily, that would require a catalog
>>> update to increment pg_am.amsupport entry, but we could hardwire the
>>> support for int8 distances, like I hardwired the knowledge of float8's
>>> comparsion function in the attached patch.
>>
>> thanks for taking a look quickly.
>>
>> i have some questions:
>>
>> doesn't gbt_int8_distance overflow?
>
> Yes, good point.
>
>> probably result += INT64_MIN so that it fits to int8 keeping the order?
>
> It can still overflow. The maximum distance between two int64s is (2^63
> - 1) - (-2^63) = 2^64 - 1. That's the distance between the smallest
> possible int64 (-2^63) and the greatest one (2^63 - 1)

max distance: 2^64-1
min distance: 0

if you offset them by INT64_MIN,

max: 2^64-1+INT64_MIN = INT64_MAX (well, for many of C implementations)
min: INT64_MIN

so it should fit to int8.

but anyway, i agree that being compatible with the <-> operator is
more important.

>
> . The int8_dist function, which implements the <-> operator, can also
> overflow, but it checks for it and throws an error:
>
> postgres=# SELECT 9223372036854775807 <-> -9223372036854775808;
> ERROR: bigint out of range
>
> We could add such an overflow check to gbt_int8_distance(), but I wonder
> if it would have some surprising effects. GiST might calculate distance
> to a value that's not in the final result set, so you'd get an error
> that you wouldn't get without the index. Then again, the alternative
> plan is to fetch the matching rows and sort, and the distance calculated
> for the sort will also overflow.
>
> Another solution for gbt_int8_distance() would be to treat the distance
> as an unsigned value, changing the comparison operation accordingly.

it sounds cleaner.

>
> I see that the distance functions for int4 and int2 can also overflow.
> The gist support function doesn't have the same problem with float8
> representation, but I think it would be good idea to change int4_dist to
> return a bigint, so that it couldn't overflow (and int2_dist to return
> an int4).

do you mean to changing the type of <-> operators so that they can't overflow?
i think it's good if compatibility is not a problem.
although it isn't possible for "biggest" types like numeric, it isn't a
regression as they can't be supported by the current knn gist anyway.

>
>> isn't strategy number necessary to find out the sorting semantics for
>> the operator's sortfamily? (your patch doesn't change it, but...)
>
> Sorry, I didn't understand that. Can you elaborate?

the following is my understanding, which can be entirely wrong as i'm just
learning the code.

- an index should yield an order compatible with the result of
amopopr sorted with amopsortfamily.
- the result of the distance support function is normally an equivalent of
the result of amopopr. in that case, its result should be sorted in
a semantically compatible way with amopsortfamily.
- an index can have more than one ordering ops. their amopsortfamily need
not to be the same. in that case, support functions need to check the
strategy number to distinguish them.

so i thought your distance_cmp support function might want to take a strategy
number as a distance support function does.

YAMAMOTO Takashi

>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message mark.phillips 2012-01-20 21:31:03 BUG #6404: postgres account not created during unattended install
Previous Message Peter Geoghegan 2012-01-20 03:07:19 Re: Documentation bug: reference to checkpoint activity in bgwriter