Re: WIP: Fast GiST index build

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Fast GiST index build
Date: 2011-08-12 08:23:55
Message-ID: 4E44E31B.8080202@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11.08.2011 23:30, Alexander Korotkov wrote:
> On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas<
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> On 10.08.2011 22:44, Alexander Korotkov wrote:
>>
>>> Manual and readme updates.
>>>
>>
>> Thanks, I'm reviewing these now.
>>
>> Do we want to expose the level-step and buffersize parameters to users?
>> They've been useful during testing, but I'm thinking we should be able to
>> guess good enough values for them automatically, and just remove the
>> options. It's pretty much impossible for a user to tune them correctly, it
>> would require deep knowledge of the buffering algorithm.
>>
>> I'm thinking that even when you explicitly turn buffering on, we should
>> still process the first 10000 or so tuples with simple inserts. That way we
>> always have a sample of tuples to calculate the average tuple size from.
>> It's plausible that if the input data is ordered, looking at the first N
>> tuples will give skewed sample, but I don't think there's much danger of
>> that in practice. Even if the data is ordered, the length of GiST tuples
>> shouldn't vary much.
>>
>> What happens if we get the levelstep and pagesPerBuffer estimates wrong?
>> How sensitive is the algorithm to that? Or will we run out of memory? Would
>> it be feasible to adjust those in the middle of the index build, if we e.g
>> exceed the estimated memory usage greatly?
>
>
> I see the following risks.
>
> For levelstep:
> Too small: not so great IO benefit as can be
> Too large:
> 1) If subtree doesn't fit effective_cache, much more IO then should be
> (because of cache misses during buffer emptying)
> 2) If last pages of buffers don't fit to maintenance_work_mem, possible
> OOM

Hmm, we could avoid running out of memory if we used a LRU cache
replacement policy on the buffer pages, instead of explicitly unloading
the buffers. 1) would still apply, though.

> For buffersize:
> Too small: less IO benefit, becuse buffer size is relatively small in
> comparison with sub-tree size.
> Too large: greater CPU overhead (because of more penalty calls) then can be
> with same IO benefit.
>
> Thereby I propose following.
> 1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
> it. Pessimistic estimate has following logic:
> largest sub-tree => maximal tuples per page => minimal tuple size
> Thereby always using minimal tuple size in levelstep calculation we exclude
> greatest risks.
> 2) Risks of buffersize are comparable and not too critical. Thats why I
> propose to use size of first 10000 tuples for estimate.

Yep, sounds reasonable.

I think it would also be fairly simple to decrease levelstep and/or
adjust buffersize on-the-fly. The trick would be in figuring out the
heuristics on when to do that.

Another thing occurred to me while looking at the buffer emptying
process: At the moment, we stop emptying after we've flushed 1/2 buffer
size worth of tuples. The point of that is to avoid overfilling a
lower-level buffer, in the case that the tuples we emptied all landed on
the same lower-level buffer. Wouldn't it be fairly simple to detect that
case explicitly, and stop the emptying process only if one of the
lower-level buffers really fills up? That should be more efficient, as
you would have "swap" between different subtrees less often.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2011-08-12 08:33:07 Re: our buffer replacement strategy is kind of lame
Previous Message Robert Haas 2011-08-12 04:05:31 our buffer replacement strategy is kind of lame