Re: GSOC Student Project Idea

From: Michael Schuh <schuh(dot)mike(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSOC Student Project Idea
Date: 2013-04-24 19:10:58
Message-ID: CAA43Kd0y0Hq04jMrhrfe3Yu4C94aBOoRto6Patu3z2=xh5r_Vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 24, 2013 at 5:31 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:

> On Apr23, 2013, at 23:25 , Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > I've taken a brief look on the paper and implementation. As I can see
> iDistance implements some global building strategy. I mean, for example, it
> selects some point, calculates distances from selected point to all points
> in dataset etc. So, it uses the whole dataset at the same time.
> >
> > However you can try to implement global index building in GiST or
> SP-GiST. In this case I think you should carefully estimate your
> capabilities during single GSoC project. You would need to extend GiST or
> SP-GiST interface and write completely new implementation of tree building
> algorithm. Question of how to exactly extend GiST or SP-GiST interface for
> this case could appear to be very hard even theoretically.
>
> +1. That seemed to be a major roadblock to me too when I read the paper.
>
> You could work around that by making partition identification a separate
> step. You'd have a function
>
> idist_analyze(cfg name, table name, field name)
>
> which'd identify suitable partitions for the data distribution in
> table.field and store them somewhere. Such a set of pre-identified
> partitions would be akin to a tsearch configuration, i.e. all other parts
> of the iDistance machinery would use it to map points to index keys and
> queries to ranges of those keys. You'll want to look at how tsearch handles
> that, and check if the method can indeed be applied to iDistance.
>
> In a first cut, you'd probably only allow inserts into index which don't
> change the maximum distances from the partition centers that
> idist_analyze() found. With that restriction in place, you might get away
> without GiST or SP-GiSt, and simply use postgres' standard btree, if you
> find a way to map queries of the form "field <idistance-knn-operator>
> idistance_knn_function(point, distance)" to "where (idstance_keys(field)
> between P1_lowerbound AND P2_upperbound) OR (idistance_keys(field) between
> P2_lowerbuild AND P2_upperbound) …" or something equivalent.

Thank you both for the very helpful feedback. Perhaps the scope of this
project (application's "completeness criteria") is better as
a feasibility prototyping of the global/distance-based index strategy with
B+-tree and/or GiST extension possibilities.

Let me briefly elaborate on the current implementation details related to
your comments. The public C++ version uses a stand-alone STX B+-tree
implementation, and has only been used in the research context of one-time
data loading and indexing and then KNN retrieval performance efficiency.
This requires an upfront global partition building (sequential look at all
data points) and a bit of overhead information about the partitions (such
as reference point locations and maximum distances in each). Of course,
there are a lot of improvements, modifications, variants, etc., that can be
built from this basic setup... but that's mostly my in-progress thesis work
yet published or defended :)

A given query is broken down into range(s) of key values in the B+-tree,
based on the negligible overhead info kept from partitioning. Only then
does this small subset of pages need to be loaded from disk. where the
partitions are located with respect to the query. Therefore it is necessary
to have left/right leaf pointers within the B+-tree. While I think a
DBMS-integrated implementation would be more ideal for general use by
everyone, my naive assumption is that I could probably implement the
idistance functionality in stored procedures and metadata tables (at the
cost of performance and usability).

Best regards,
Mike Schuh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Миша Тюрин 2013-04-24 19:11:45 Re[3]: [HACKERS] high io BUT huge amount of free memory
Previous Message Миша Тюрин 2013-04-24 19:04:59 Re[2]: [HACKERS] high io BUT huge amount of free memory