Re: Fractal tree indexing

From: Greg Stark <stark(at)mit(dot)edu>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fractal tree indexing
Date: 2013-02-14 01:48:16
Message-ID: CAM-w4HOvnPuOiyWky3dtuYhXW_K+Un4VsLhWPE0=vdXOJ9+tbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 13, 2013 at 1:31 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Wed, Feb 13, 2013 at 4:51 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>>
>> Heikki was talking about a generic WAL record type that would just
>> store a binary delta between the version of the block when it was
>> locked and when it was unlocked. That would handle any extension
>> cleanly as far as data modification goes as long as the extension was
>> working through our buffer manager. It seems like an attractive idea
>> to me.
>
>
> It will, for sure, works well when atomic page changes are enough for us.
> However, some operations, for example, page splits, contain changes in
> multiple pages. Replaying changes in only some of pages is not fair. Now,
> it's hard for me to imagine how to generalize it into generic WAL record
> type.

I think multiple-page operations where you have all the pages locked
at the same time, like tuple updates for example, are fairly simple.
The existing WAL format can handle at most three such buffers in a
single record so we can just have a fixed size buffer large enough to
hold three buffers and perform the diff on the three when the
extension says it has completed an atomic update. The simplest API
which I suspect would suffice for virtually all users would be to tie
this to buffer locking and unlocking.

The types of records which this would not suffice for would be

a) things that need extra book-keeping during recovery in case some
later record was not recorded. We have only a few such records -- page
splits as you say -- and hopefully extensions wouldn't need them
because frankly they're pretty scary.

b) records that are needed not just to maintain consistency of the
data in the database but to provide some other behaviour -- for
instance the WAL records for locks that 2PC needs or the snapshot
records that hot standby needs. But then these types of records might
be more amenable to an extensible WAL format. Since the loss of them
wouldn't leave the database corrupted, just prevent that feature from
operating correctly.

--
greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2013-02-14 02:40:20 Re: BUG #7493: Postmaster messages unreadable in a Windows console
Previous Message Fabrízio de Royes Mello 2013-02-14 00:16:21 Re: proposal or just idea for psql - show first N rows from relation backslash statement