Re: GSoC 2014 proposal

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Иван Парфилов <iparfilov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC 2014 proposal
Date: 2014-04-03 13:15:17
Message-ID: CAPpHfdt6B-m7hz6LexAx9EriWKAWQmK=ZDHzEw3mxrL3nxUU4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> The BIRCH algorithm as described in the paper describes building a tree
>> in memory. If I understood correctly, you're suggesting to use a pre-built
>> GiST index instead. Interesting idea!
>>
>> There are a couple of signifcant differences between the CF tree
>> described in the paper and GiST:
>>
>> 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
>> a leaf item represents a cluster, which consists of one or more tuples. So
>> the CF tree doesn't store an entry for every input tuple, which makes it
>> possible to keep it in memory.
>>
>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
>> requirement, with respect to a threshold value T: the diameter (or radius)
>> has to be less than T". GiST imposes no such restrictions. An item can
>> legally be placed anywhere in the tree; placing it badly will just lead to
>> degraded search performance, but it's still a legal GiST tree.
>>
>> 3. A GiST index, like any other index in PostgreSQL, holds entries also
>> for deleted tuples, until the index is vacuumed. So you cannot just use
>> information from a non-leaf node and use it in the result, as the
>> information summarized at a non-leaf level includes noise from the dead
>> tuples.
>>
>> Can you elaborate how you are planning to use a GiST index to implement
>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
>> strict in where in the tree an item can be stored, and lets the operator
>> class to specify exactly when a node is split etc.
>>
>
> Hmmm, it's likely I've imagined something quite outside of this paper, and
> even already suggested it to Ivan... :)
> I need a little time to rethink it.
>

Using GiST we can implement BIRCH-like clustering like so:
1) Build a CF tree as GiST index without restriction of T threshold value.
2) Scan CF tree with threshold T with some auxiliary operator. If
consistent method see CF entry which diameter is greater than T then it
returns true. Otherwise it returns false and put this CF entry into output
area (could be either in-memory or temporary table).
3) Process other steps of algorithm as usual.

This modification would have following advantages:
1) User can build GiST index once and then try clustering with different
parameters. Initial GiST index build would be slowest operation while other
steps is expected to be fast.
2) Use GiST infrastructure and automatically get buffering build.

The drawback is that building GiST index is more expensive than building
in-memory CF tree with given threshold T (assuming T is well chosen).

Does it make any sense?

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-04-03 13:43:20 Re: quiet inline configure check misses a step for clang
Previous Message Andres Freund 2014-04-03 12:21:54 Re: GSoC proposal - "make an unlogged table logged"