Re: KNN-GiST with recheck

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-12-16 13:37:00
Message-ID: 5490357C.8090800@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/06/2014 12:36 PM, Emre Hasegeli wrote:
>> Thanks. The main question now is design of this patch. Currently, it does
>> all the work inside access method. We already have some discussion of pro
>> and cons of this method. I would like to clarify alternatives now. I can
>> see following way:
>>
>> 1. Implement new executor node which performs sorting by priority queue.
>> Let's call it "Priority queue". I think it should be separate node from
>> "Sort" node. Despite "Priority queue" and "Sort" are essentially similar
>> from user view, they would be completely different in implementation.
>> 2. Implement some interface to transfer distance values from access
>> method to "Priority queue" node.
>
> If we assume that all of them need recheck, maybe it can be done
> without passing distance values.

No, the executor needs the lower-bound distance value, as calculated by
the indexam, so that it knows which tuples it can return from the queue
already. For example, imagine the following items coming from the index:

tuple # lower bound actual distance
1 1 1
2 2 10
3 30 30
4 40 40

After the executor has fetched tuple 2, and re-checked the distance, it
pushes the tuple to the queue. It then fetches tuple 3, with lower bound
30, and it can now immediately return tuple # 2 from the queue. Because
10 < 30, so there cannot be any more tuples coming from the index that
would need to go before tuple # 2.

The executor needs the lower bound as calculated by the index, as well
as the actual distance it calculates itself, to make those decisions.

>> 3. Somehow tell the planner that it could use "Priority queue" in
>> corresponding cases. I see two ways of doing this:
>> - Add flag to operator in opclass indicating that index can only
>> order by lower bound of "col op value", not by "col op value" itself.
>> - Define new relation between operators. Value of one operator could
>> be lower bound for value of another operator. So, planner can
>> put "Priority
>> queue" node when lower bound ordering is possible from index. Also "ALTER
>> OPERATOR" command would be reasonable, so extensions could upgrade.
>
> I think, it would be better to make it a property of the operator
> class. We can add a column to pg_amop or define another value for
> amoppurpose on pg_amop. Syntax can be something like this:
>
> CREATE OPERATOR CLASS circle_ops DEFAULT
> FOR TYPE circle USING gist AS
> OPERATOR 15 <->(circle, point) FOR ORDER BY pg_catalog.float_ops LOWER BOUND;
>
> While looking at it, I realize that current version of the patch does
> not use the sort operator family defined with the operator class. It
> assumes that the distance function will return values compatible with
> the operator. Operator class definition makes me think that there is
> not such an assumption.

Yeah. I also noticed that the type of the argument passed to the
consistent function varies, and doesn't necessarily match that declared
in pg_proc. Looking at gist_point_consistent, the argument type can be a
point, a polygon, or a circle, depending on the "strategy group". But
it's declared as a point in pg_proc.

>> Besides overhead, this way makes significant infrastructural changes. So,
>> it may be over-engineering. However, it's probably more clean and beautiful
>> solution.
>> I would like to get some feedback from people familiar with KNN-GiST like
>> Heikki or Tom. What do you think about this? Any other ideas?
>
> I would be happy to test and review the changes. I think it is nice
> to solve the problem in a generalized way improving the access method
> infrastructure. Definitely, we should have a consensus on the design
> before working on the infrastructure changes.

I took a stab on this. I added the reorder queue directly to the Index
Scan node, rather than adding a whole new node type for it. It seems
reasonable, as Index Scan is responsible for rechecking the quals, too,
even though re-ordering the tuples is more complicated than rechecking
quals.

To recap, the idea is that the index can define an ordering op, even if
it cannot return the tuples in exactly the right order. It is enough
that for each tuple, it returns a lower bound of the expression that is
used for sorting. For example, for "ORDER BY key <-> column", it is
enough that it returns a lower bound of "key <-> column" for each tuple.
The index must return the tuples ordered by the lower bounds. The
executor re-checks the expressions, and re-orders the tuples to the
correct order.

Patch attached. It should be applied on top of my pairing heap patch at
http://www.postgresql.org/message-id/548FFA2C.7060000@vmware.com. Some
caveats:

* The signature of the distance function is unchanged, it doesn't get a
recheck argument. It is just assumed that if the consistent function
sets the recheck flag, then the distance needs to be rechecked as well.
We might want to add the recheck argument, like you Alexander did in
your patch, but it's not important right now.

* I used the "distance" term in the executor, although the ORDER BY expr
machinery is more general than that. The value returned by the ORDER BY
expression doesn't have to be a distance, although that's the only thing
supported by GiST and the built-in opclasses.

* I short-circuited the planner to assume that the ORDER BY expression
always returns a float. That's true today for knn-GiST, but is obviously
a bogus assumption in general.

This needs some work to get into a committable state, but from a
modularity point of view, this is much better than having the indexam to
peek into the heap.

- Heikki

Attachment Content-Type Size
knn-gist-recheck-distance-in-executor-1.patch text/x-diff 34.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2014-12-16 13:37:49 Re: Commitfest problems
Previous Message David Fetter 2014-12-16 13:31:45 Re: NUMERIC private methods?