Re: GSoC 2014 proposal

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Иван Парфилов <iparfilov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC 2014 proposal
Date: 2014-04-03 19:44:03
Message-ID: CAPpHfdteAfUo5YG1Q46Hq3nm9LDkFJcE3feOKp=rJwqPUrbvUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 3, 2014 at 11:21 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 04/03/2014 04:15 PM, Alexander Korotkov wrote:
>
>> On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com
>> >wrote:
>>
>> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
>>> hlinnakangas(at)vmware(dot)com> wrote:
>>>
>>> The BIRCH algorithm as described in the paper describes building a tree
>>>> in memory. If I understood correctly, you're suggesting to use a
>>>> pre-built
>>>> GiST index instead. Interesting idea!
>>>>
>>>> There are a couple of signifcant differences between the CF tree
>>>> described in the paper and GiST:
>>>>
>>>> 1. In GiST, a leaf item always represents one heap tuple. In the CF
>>>> tree,
>>>> a leaf item represents a cluster, which consists of one or more tuples.
>>>> So
>>>> the CF tree doesn't store an entry for every input tuple, which makes it
>>>> possible to keep it in memory.
>>>>
>>>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
>>>> requirement, with respect to a threshold value T: the diameter (or
>>>> radius)
>>>> has to be less than T". GiST imposes no such restrictions. An item can
>>>> legally be placed anywhere in the tree; placing it badly will just lead
>>>> to
>>>> degraded search performance, but it's still a legal GiST tree.
>>>>
>>>> 3. A GiST index, like any other index in PostgreSQL, holds entries also
>>>> for deleted tuples, until the index is vacuumed. So you cannot just use
>>>> information from a non-leaf node and use it in the result, as the
>>>> information summarized at a non-leaf level includes noise from the dead
>>>> tuples.
>>>>
>>>> Can you elaborate how you are planning to use a GiST index to implement
>>>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
>>>> strict in where in the tree an item can be stored, and lets the operator
>>>> class to specify exactly when a node is split etc.
>>>>
>>>>
>>> Hmmm, it's likely I've imagined something quite outside of this paper,
>>> and
>>> even already suggested it to Ivan... :)
>>> I need a little time to rethink it.
>>>
>>>
>> Using GiST we can implement BIRCH-like clustering like so:
>> 1) Build a CF tree as GiST index without restriction of T threshold value.
>> 2) Scan CF tree with threshold T with some auxiliary operator. If
>> consistent method see CF entry which diameter is greater than T then it
>> returns true. Otherwise it returns false and put this CF entry into output
>> area (could be either in-memory or temporary table).
>> 3) Process other steps of algorithm as usual.
>>
>
> I still don't understand how that would work. You can't trust the non-leaf
> entries, because their CF value can contain deleted entries. So you have to
> scan every tuple anyway. Once you do that, what's the point of the index?

Yeah, it is limitation of this idea. It's not going to be auto-updatable
CF. User can build index on freshly vacuumed table and play with clustering
some time. Updates on table during that time would be either allowed
clustering error or prohibited. Another potential solution is to let this
index to hold some snapshot of data. But it seems not possible to do in
extension now.

------
With best regards,
Alexander Korotkov.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-04-03 21:10:46 Re: [COMMITTERS] pgsql: Add ALTER TABLESPACE ... MOVE command
Previous Message Heikki Linnakangas 2014-04-03 19:21:57 Re: GSoC 2014 proposal