Re: Yet another fast GiST build

From: Oleg Bartunov <obartunov(at)postgrespro(dot)ru>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Darafei Komяpa Praliaskouski <me(at)komzpa(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Yet another fast GiST build
Date: 2020-09-10 10:21:09
Message-ID: CAF4Au4wCj6W2JZUQD3BU+Qca5e853Om6QTgmm+TsiBUzePM-bg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 7, 2020 at 7:50 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 07/09/2020 13:59, Pavel Borisov wrote:
> >>>> I suppose there is a big jump in integer value (whether signed or
> >>>> unsigned) as you cross from positive to negative floats, and then the
> >>>> sort order is reversed. I have no idea if either of those things is a
> >>>> problem worth fixing. That made me wonder if there might also be an
> >>
> >> I took a stab at fixing this, see attached patch (applies on top of your
> >> patch v14).
> >>
> >> To evaluate this, I used the other attached patch to expose the zorder
> >> function to SQL, and plotted points around zero with gnuplot. See the
> >> attached two images, one with patch v14, and the other one with this patch.
> >
> > I'd made testing of sorted SpGist build in cases of points distributed only
> > in 2d quadrant and points in all 4 quadrants and it appears that this
> > abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
> > is really nice way to avoid what can be avoided and I'd like it is included
> > together with Andrey's patch.
>
> Thanks! Did you measure the quality of the built index somehow? The
> ordering shouldn't make any difference to the build speed, but it
> affects the shape of the resulting index and the speed of queries
> against it.
>
> I played with some simple queries like this:
>
> explain (analyze, buffers) select count(*) from points_good where p <@
> box(point(50, 50), point(75, 75));
>
> and looking at the "Buffers" line for how many pages were accessed.
> There doesn't seem to be any consistent difference between v14 and my
> fix. So I concur it doesn't seem to matter much.
>
>
> I played some more with plotting the curve. I wrote a little python
> program to make an animation of it, and also simulated how the points
> would be divided into pages, assuming that each GiST page can hold 200
> tuples (I think the real number is around 150 with default page size).
> In the animation, the leaf pages appear as rectangles as it walks
> through the Z-order curve. This is just a simulation by splitting all
> the points into batches of 200 and drawing a bounding box around each
> batch. I haven't checked the actual pages as the GiST creates, but I
> think this gives a good idea of how it works.

Heikki, you may use our gevel extension to visualize index tree
http://www.sai.msu.su/~megera/wiki/Gevel
I used it to investigate rtree index
http://www.sai.msu.su/~megera/wiki/Rtree_Index

>
> The animation shows that there's quite a lot of overlap between the
> pages. It's not necessarily this patch's business to try to improve
> that, and the non-sorting index build isn't perfect either. But it
> occurs to me that there's maybe one pretty simple trick we could do:
> instead of blindly filling the leaf pages in Z-order, collect tuples
> into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples,
> or something in that ballpark, or maybe go all the way up to work_mem.
> When the buffer fills up, call the picksplit code to divide the buffer
> into the actual pages, and flush them to disk. If you look at the
> animation and imagine that you would take a handful of pages in the
> order they're created, and re-divide the points with the split
> algorithm, there would be much less overlap.

Interesting to see also the size of index, it should be several times less.

>
> - Heikki

--
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-09-10 10:31:07 Re: Bug in logical decoding of in-progress transactions
Previous Message vignesh C 2020-09-10 10:16:58 Re: Improvements in Copy From