Re: KNN-GiST with recheck

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-08-04 20:43:41
Message-ID: CAPpHfdu3JmBrsgsrtaAQJMpXyMgrW+T2P7Gmf9oP8L1V7zLvOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 3, 2014 at 5:48 PM, Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:

> > > 1. This patch introduces a new "polygon <-> point" operator. That seems
> > > useful on its own, with or without this patch.
> >
> > Yeah, but exact-knn cant come with no one implementation. But it would
> > better come in a separate patch.
>
> I tried to split them. Separated patches are attached. I changed
> the order of the arguments as point <-> polygon, because point was
> the first one on all the others. Its commutator was required for
> the index, so I added it on the second patch. I also added tests
> for the operator. I think it is ready for committer as a separate
> patch. We can add it to the open CommitFest.
>
> I have made some cosmetic changes on the patches. I hope they are
> useful.
>
> I added support to point <-> circle operator with the same GiST
> distance function you added for polygon. I can change it, if it is not
> the right way.

Great, thanks!

> > 2. I wonder how useful it really is to allow mixing exact and non-exact

> > > return values from the distance function. The distance function
> included in
> > > the patch always returns recheck=true. I have a feeling that all other
> > > distance function will also always return either true or false.
> >
> > For geometrical datatypes recheck variations in consistent methods are
> also
> > very rare (I can't remember any). But imagine opclass for arrays where
> keys
> > have different representation depending on array length. For such opclass
> > and knn on similarity recheck flag could be useful.
>
> I also wonder how useful it is. Your example is convincing, but maybe
> setting it index-wide will make the decisions on the framework easier.
> For example, how hard would it be to decide if further sorting is
> required or not on the planner?
>

I think that for most use cases just some operators require further sorting
and some of them not. But it could appear one day that some index gives
part of its knn answers exact and part of them inexact. Same happen to
recheck of regular operators. Initially recheck flag was defined in
opclass. But later recheck became runtime flag.

> > 4. (as you mentioned in the other thread: ) It's a modularity violation
> > > that you peek into the heap tuple from gist. I think the proper way to
> do
> > > this would be to extend the IndexScan executor node to perform the
> > > re-shuffling of tuples that come from the index in wrong order, or
> perhaps
> > > add a new node type for it.
> > >
> > > Of course that's exactly what your partial sort patch does :-). I
> haven't
> > > looked at that in detail, but I don't think the approach the partial
> sort
> > > patch takes will work here as is. In the KNN-GiST case, the index is
> > > returning tuples roughly in the right order, but a tuple that it
> returns
> > > might in reality belong somewhere later in the ordering. In the partial
> > > sort patch, the "input stream" of tuples is divided into
> non-overlapping
> > > groups, so that the tuples within the group are not sorted, but the
> groups
> > > are. I think the partial sort case is a special case of the KNN-GiST
> case,
> > > if you consider the lower bound of each tuple to be the leading keys
> that
> > > you don't need to sort.
> >
> > Yes. But, for instance btree accesses heap for unique checking. Is really
> > it so crimilal? :-)
> > This is not only question of a new node or extending existing node. We
> need
> > to teach planner/executor access method can return value of some
> expression
> > which is lower bound of another expression. AFICS now access method can
> > return only original indexed datums and TIDs. So, I afraid that enormous
> > infrastructure changes are required. And I can hardly imagine what they
> > should look like.
>
> Unfortunately, I am not experienced enough to judge your implementation.
> As far as I understand the problem is partially sorting rows on
> the index scan node. It can lead the planner to choose non-optimal
> plans, because of not taking into account the cost of sorting.
>

Cost estimation of GiST is a big problem anyway. It doesn't care (and
can't) about amount of recheck for regular operators. In this patch, same
would be for knn recheck. The problem is that touching heap from access
method breaks incapsulation. One idea about this is to do sorting in
another nodes. However, I wonder if it would be an overengineering and
overhead. In attached patch I propose a different approach: put code
touching heap into separate index_get_heap_values function. Also new
version of patch includes regression tests and some cleanup.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-3.patch application/octet-stream 41.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message testman1316 2014-08-04 20:48:32 PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?
Previous Message Claudio Freire 2014-08-04 18:30:21 Re: Proposal: Incremental Backup