Skip site navigation (1) Skip section navigation (2)

Re: WIP: Fast GiST index build

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Fast GiST index build
Date: 2011-09-05 16:30:09
Message-ID: 4E64F911.8010007@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On 05.09.2011 14:10, Heikki Linnakangas wrote:
> On 01.09.2011 12:23, Alexander Korotkov wrote:
>> On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
>> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>
>>> So I changed the test script to generate the table as:
>>>
>>> CREATE TABLE points AS SELECT random() as x, random() as y FROM
>>> generate_series(1, $NROWS);
>>>
>>> The unordered results are in:
>>>
>>> testname | nrows | duration | accesses
>>> -----------------------------+**-----------+-----------------+**----------
>>>
>>> points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
>>> points unordered auto | 250000000 | 05:34:12.187479 | 2246420
>>> points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228
>>>
>>> Although the buffered build doesn't lose as badly as it did with more
>>> overlap, it still doesn't look good :-(. Any ideas?
>>
>> But it's still a lot of overlap. It's about 220 accesses per small area
>> request. It's about 10 - 20 times greater than should be without
>> overlaps.
>> If we roughly assume that 10 times more overlap makes 1/10 of tree to be
>> used for actual inserts, then that part of tree can easily fit to the
>> cache.
>> You can try my splitting algorithm on your test setup (it this case I
>> advice
>> to start from smaller number of rows, 100 M for example).
>> I'm requesting real-life datasets which makes troubles in real life from
>> Oleg. Probably those datasets is even larger or new linear split produce
>> less overlaps on them.
>
> I made a small tweak to the patch, and got much better results (this is
> with my original method of generating the data):
>
> testname | nrows | duration | accesses
> -----------------------------+-----------+-----------------+----------
> points unordered buffered | 250000000 | 03:34:23.488275 | 3945486
> points unordered auto | 250000000 | 02:55:10.248722 | 3767548
> points unordered unbuffered | 250000000 | 04:02:04.168138 | 4564986

The full results of this test are in:

           testname           |   nrows   |    duration     | accesses
-----------------------------+-----------+-----------------+----------
  points unordered buffered   | 250000000 | 03:34:23.488275 |  3945486
  points unordered auto       | 250000000 | 02:55:10.248722 |  3767548
  points unordered unbuffered | 250000000 | 04:02:04.168138 |  4564986
  points ordered buffered     | 250000000 | 02:00:10.467914 |  5572906
  points ordered auto         | 250000000 | 02:16:01.859039 |  5435673
  points ordered unbuffered   | 250000000 | 03:23:18.061742 |  1875826
(6 rows)

Interestingly, in this test case the buffered build was significantly 
faster even in the case of ordered input - but the quality of the 
produced index was much worse. I suspect it's because of the 
last-in-first-out nature of the buffering, tuples that pushed into 
buffers first are flushed to lower levels last. Tweaking the data 
structures to make the buffer flushing a FIFO process might help with 
that, but I'm afraid that might make our cache hit ratio worse when 
reading pages from the temporary file.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

pgsql-hackers by date

Next:From: Tom LaneDate: 2011-09-05 16:40:14
Subject: Reminder: 9.1 release is upcoming
Previous:From: Tom LaneDate: 2011-09-05 15:59:22
Subject: Re: [COMMITTERS] pgsql: Clean up the #include mess a little.

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group