Re: Bug in new buffering GiST build code

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in new buffering GiST build code
Date: 2012-05-27 21:46:53
Message-ID: CAPpHfdvwEatx1+8ksdQey7HO=7eNxZ_TaQMJj4xaUB21p4jVWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> I think we should rewrite the way we track the parents completely. Rather
> than keep a path stack attached to every node buffer, let's just maintain a
> second hash table that contains the parent of every internal node. Whenever
> a downlink is moved to another page, update the hash table with the new
> information. That way we always have up-to-date information about the
> parent of every internal node.
>
> That's much easier to understand than the path stack structures we have
> now. I think the overall memory consumption will be about the same too.
> Although we need the extra hash table with one entry for every internal
> node, we get rid of the path stack structs, which are also one per every
> internal node at the moment.
>
> I believe it is faster too. I added some instrumentation to the existing
> gist code (with the additional getNodeBuffer() call added to fix this bug),
> to measure the time spent moving right, when refinding the parent of a
> page. I added gettimeofday() calls before and after moving right, and
> summed the total. In my test case, the final index size was about 19GB, and
> the index build took 3545 seconds (59 minutes). Of that time, 580 seconds
> (~ 10 minutes) was spent moving right to refind parents. That's a lot. I
> also printed a line whenever a refind operation had to move right 20 pages
> or more. That happened 2482 times during the build, in the worst case we
> moved right over 40000 pages.
>
> Attached is a patch to replace the path stacks with a hash table. With
> this patch, the index build time in my test case dropped from 59 minutes to
> about 55 minutes. I don'ẗ know how representative or repeatable this test
> case is, but this definitely seems very worthwhile, not only because it
> fixes the bug and makes the code simpler, but also on performance grounds.
>

Cool, seems that we've both simplier and faster implementation of finding
parent. Thanks!

> Alexander, do you still have the test environments and data lying around
> that you used for GiST buffering testing last summer? Could you rerun some
> of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2012-05-27 22:14:01 Re: Synchronized scans versus relcache reinitialization
Previous Message Euler Taveira 2012-05-27 19:54:24 Re: No, pg_size_pretty(numeric) was not such a hot idea