Re: WIP: Fast GiST index build

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Fast GiST index build
Date: 2011-08-26 14:31:23
Message-ID: CAPpHfdtbZifApY+_W0OrxypUcJ6U1QPvhPe2HYsAwy30wNFNNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 25, 2011 at 10:53 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

>
>> In the tests on the first version of patch I found index quality of
>> regular
>> build much better than it of buffering build (without neighborrelocation).
>> Now it's similar, though it's because index quality of regular index build
>> become worse. There by in current tests regular index build is faster than
>> in previous. I see following possible causes of it:
>> 1) I didn't save source random data. So, now it's a new random data.
>> 2) Some environment parameters of my test setup may alters, though I
>> doubt.
>> Despite these possible explanation it seems quite strange for me.
>>
>
> That's pretty surprising. Assuming the data is truly random, I wouldn't
> expect a big difference in the index quality of one random data set over
> another. If the index quality depends so much on, say, the distribution of
> the few first tuples that are inserted to it, that's a quite interesting
> find on its own, and merits some further research.

Yeah, it's pretty strange. Using same random datasets in different tests can
help to exclude onepossible cause of difference.

> In order to compare index build methods on more qualitative indexes, I've
>> tried to build indexes with my double sorting split method (see:
>> http://syrcose.ispras.ru/2011/**files/SYRCoSE2011_Proceedings.**
>> pdf#page=36<http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36>).
>> So
>> on uniform dataset search is faster in about 10 times! And, as it was
>> expected, regular index build becomes much slower. It runs more than 60
>> hours and while only 50% of index is complete (estimated by file sizes).
>>
>> Also, automatic switching to buffering build shows better index quality
>> results in all the tests. While it's hard for me to explain that.
>>
>
> Hmm, makes me a bit uneasy that we're testing with a modified page
> splitting algorithm. But if the new algorithm is that good, could you post
> that as a separate patch, please?
>
I've post it in another message and I will try to get it into more
appropriate form. Let me clarify this a little. I don't think my split
algorithm is 10 times better than state of the art algorithms. I think that
currently used new linear split shows unreasonably bad results in may cases.
For example, uniformly distributed data is pretty easy case. And with almost
any splitting algorithm we can get index with almost zero overlaps. But new
linear split produces huge overlaps in this case. That's why I decided to
make some experiments with another split algorithm.

------
With best regards,
Alexander Korotkov.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Page 2011-08-26 14:36:27 Re: Questions and experiences writing a Foreign Data Wrapper
Previous Message Alexander Korotkov 2011-08-26 14:18:41 Re: WIP: Fast GiST index build