Re: WIP: Fast GiST index build

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Fast GiST index build
Date: 2011-07-22 09:38:35
Message-ID: CAPpHfdvoOh78ycX3eRePTS29635pHCSTfLdFDzHZhxTKsggCuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Patch with my try to detect ordered datasets is attached. The implemented
idea is desribed below.
Index tuples are divided by chunks of 128. On each chunk we measure how much
leaf pages where index tuples was inserted don't match those of previous
chunk. Based on statistics of several chunks we estimate distribution of
accesses between lead pages (exponential distribution law is accumed and
it's seems to be an error). After that we can estimate portion of index
tuples which can be processed without actual IO. If this estimate exceeds
threshold then we should switch to buffering build.
Now my implementation successfully detects randomly mixed datasets and well
ordered datasets, but it's seems to be too optimistic about intermediate
cases. I believe it's due to wrong assumption about distribution law.
Do you think this approach is acceptable? Probably there are some researches
about distribution law for such cases (while I didn't find anything relevant
in google scholar)?
As an alternative I can propose take into account actual average IO
operations per tuple rather then an estimate.

------
With best regards,
Alexander Korotkov.

On Mon, Jul 18, 2011 at 10:00 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com>wrote:

> Hi!
>
> New version of patch is attached. There are following changes.
> 1) Since proposed tchnique is not always a "fast" build, it was renamed
> everywhere in the patch to "buffering" build.
> 2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto".
> "auto" means automatic switching from regular index build to buffering one.
> Currently it just switch when index size exceeds maintenance_work_mem.
> 3) Holding of many buffers pinned is avoided.
> 4) Rebased with head.
>
> TODO:
> 1) Take care about ordered datasets in automatic switching.
> 2) Take care about concurrent backends in automatic switching.
>
> ------
> With best regards,
> Alexander Korotkov.
>

Attachment Content-Type Size
gist_fast_build-0.8.0.patch.gz application/x-gzip 28.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei Kaigai 2011-07-22 09:55:37 Re: [v9.1] sepgsql - userspace access vector cache
Previous Message Heikki Linnakangas 2011-07-22 09:29:43 Re: Questions and experiences writing a Foreign Data Wrapper