Re: WIP: Fast GiST index build

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Fast GiST index build
Date: 2011-06-27 18:32:47
Message-ID: BANLkTi=1TL+pH2sxiR7nf3oMFdWAHHgSyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 27, 2011 at 6:34 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> The penalty function is called whenever a tuple is routed to the next level
> down, and the final tree has the same depth with and without the patch, so I
> would expect the number of penalty calls to be roughly the same. But clearly
> there's something wrong with that logic; can you explain in layman's terms
> why the patch adds so many gist penalty calls? And how many calls does it
> actually add, can you gather some numbers on that? Any ides on how to
> mitigate that, or do we just have to live with it? Or maybe use some
> heuristic to use the existing insertion method when the patch is not
> expected to be helpful?
>
In short due to parralel routing of many index tuples routing can alter. In
fast build algorithm index tuples are accumulating into node buffers. When
corresponding node splits we have to repocate index tuples from it. In
original algorithm we are relocating node buffers into buffers of new nodes
produced by split. Even this requires additional penalty calls.
But for improvement of index quality I modified algorithm. With my
modification index tuple of splitted node buffer can be relocated also into
other node buffers of same parent. It produces more penalty calls.
I didn't have an estimate yet, but I'm working on it. Unfortunatelly, I
haven't any idea about mitigating it except turning off my modification.
Heuristic is possible, but I feel following problems. At first, we need to
somehow estimate length of varlena keys. I avoid this estimate in fast
algorithm itself just assumed worst case, but I believe we need some more
precise for good heuristic. At second, the right decision is strongly depend
on concurrent load. When there are no concurrent load (as in my experiments)
fraction of tree which fits to effective cache is reasonable for estimating
benefit of IO economy. But with high concurrent load part of cache occupied
by tree should be considerable smaller than whole effective cache.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2011-06-27 18:34:16 Re: pg_upgrade defaulting to port 25432
Previous Message Bruce Momjian 2011-06-27 18:27:54 Re: pg_upgrade defaulting to port 25432