GSoC'14: KNN on SP-GiST

From: Vladislav Sterzhanov <gliderok(at)gmail(dot)com>
To: pgsql-students(at)postgresql(dot)org
Subject: GSoC'14: KNN on SP-GiST
Date: 2014-03-26 23:19:18
Message-ID: CAK2aJC1_HZ-YNA_BQmqezr0Tb0dQALcWNjCcqBju2hes3OxeKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-students

Greetings, Yusuph Adekoya!
My fault, forgot that it's only visible for mentors, quoting here (with the
feedback from Heikki Linnakangas):

My Project:
>
> Rather straightforward, the ultimate goal of the project is in empowering
> PostgreSQL's SP-GiST index type with a capability of performing
> nearest-neighbour search, which naturally divides in a sequence of goals:
> first of all, the general SP-GiST interface for KNN should be designed and
> introduced. Then comes adjusting of all the currently existing SP-GiST
> structures (k-d tree, radix trie over text and quad trees) in order to make
> all of them use appropriate nearest-neighbourhood picking logic and,
> finally, a thorough testing of each one.
>
> As I see it, pushing the general high-level interface down to
> implementations is nescessary in this situation (in contrast to regular
> PostgreSQL practices) since the neighbouring logic in different SP-GiSTs
> can vary: e.g. in a quad-tree, if we habe two leaves branching from a
> single down-level inner node and thus belonging to the closest (in their
> relation) quadrant we can't guarantee their "global" closeness (in contrast
> to e.g. text trie where it's true for two leaves having the most common
> prefix).
>
> Schedule / Timeline:
>
> - Milestone 1 - Pre-Community-bonding - get to know the related codebase
> of PostgreSQL in order to ensure that the following work will continue in
> accordance with best practices.
>
> - Milestone 2 - Community-bonding + 2 weeks: Get everything structured and
> discuss project with community (due to the late application). Set up coding
> and testing environment. By this time I need to clearly imagine and sketch
> the project.
>
> - Milestone 3 - till the early June - implement the general high-level
> interface.
>
> - Milestone 4 - till the midterm - implement and test the specification
> for text radix-trie. Ensure everything goes at right direction, prepare for
> midterm.
>
> - Milestone 5 - till end of July - implement the specification for k-d
> quad- trees.
>
> - Milestone 6 - till final eval - test everything up thoroughly, final
> discussion of results and new directions with the community.
>
> *Heikki Linnakangas* March 23, 2014, 2:04 p.m.<http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/quadrocube/5662278724616192?verified=True#c5727390428823552>
>
> Sounds good. There are some design details that need to be figured out: *
> What new support functions does an operator class need to implement, to
> support kNN-searches? (GiST needs a "distance" function, is that how it
> would work with SP-GiST too?) * You said you'd implement the support for
> text tries, for text datatype, k-d and quad trees, presumably for points.
> Sounds good, but I wonder what does the query that using text look like? We
> already have kNN-GiST support for points, so presumably the kNN support for
> SP-GiST would help with the same operators as the GiST ones, but what about
> text? Can you give an example of a query that could use the kNN support for
> text? * (related to previous point) I suspect it might be easier and more
> useful to implement the support for points than text, so you might want to
> do that first.
>
> *quadrocube* March 26, 2014, 11:12 p.m.<http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/quadrocube/5662278724616192?verified=True#c5698390809640960>
>
> Actually, GiST requires a distance relation only on queue entries to
> maintain a currently-nearest list of nodes to traverse the tree in the
> smallest-distance-first order. I'll need the similar queue for SP-GiST, but
> for a different purpose - simply keep track of the best-observed-points,
> without any order-modification logic. KNN search for k-d tree is
> essentially a recursive descend down the tree, populating the
> nearest-neighbour heap, followed by an ascension, which updates this heap
> by traversing the unvisited branches if needed (bounds-overlap-ball test on
> the domain of the corresponding inner nodes). KNN for quadtree is
> essentially the same, keeping in mind more intuitive space organization
> techniques. Regarding text tries - the generalization is rather obvious -
> "Find K most lexicographically closest strings from the subset to a given
> one", but the practical application of the queries like this is debatable -
> I personally thought of some kind of bioinformatics genome manipulation
> applications, but have almost no experience in the field. Moreover,
> recently I've stumbled upon a k-d tree rebuild optimization which ensures
> an average O(log n) complexity of the NN-search operation for every
> possible point distribution (which could otherwise return to O(N) for a
> specific node distribution patterns), but it needs a complete re-structure
> of the tree thus requiring the "build-once-use-often" way to work with it.
> (For instance, a link to a mentioned modification:
> http://dl.acm.org/citation.cfm?doid=355744.355745) So, it seems like the
> text implentation can really be put at the end of the to-do list, probably
> even being replaced by the closely related k-d search optimization.
>

Browse pgsql-students by date

  From Date Subject
Next Message Madhurima Das 2014-04-07 18:24:37 unable to insert column using postgresql and C
Previous Message Vladislav Sterzhanov 2014-03-22 20:34:00 GSoC'14: KNN on SP-GiST