Re: Yet another fast GiST build

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Darafei Komяpa Praliaskouski <me(at)komzpa(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Yet another fast GiST build
Date: 2020-09-07 14:10:08
Message-ID: 08173bd0-488d-da76-a904-912c35da446b@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/09/2020 13:59, Pavel Borisov wrote:
>>>> I suppose there is a big jump in integer value (whether signed or
>>>> unsigned) as you cross from positive to negative floats, and then the
>>>> sort order is reversed. I have no idea if either of those things is a
>>>> problem worth fixing. That made me wonder if there might also be an
>>
>> I took a stab at fixing this, see attached patch (applies on top of your
>> patch v14).
>>
>> To evaluate this, I used the other attached patch to expose the zorder
>> function to SQL, and plotted points around zero with gnuplot. See the
>> attached two images, one with patch v14, and the other one with this patch.
>
> I'd made testing of sorted SpGist build in cases of points distributed only
> in 2d quadrant and points in all 4 quadrants and it appears that this
> abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
> is really nice way to avoid what can be avoided and I'd like it is included
> together with Andrey's patch.

Thanks! Did you measure the quality of the built index somehow? The
ordering shouldn't make any difference to the build speed, but it
affects the shape of the resulting index and the speed of queries
against it.

I played with some simple queries like this:

explain (analyze, buffers) select count(*) from points_good where p <@
box(point(50, 50), point(75, 75));

and looking at the "Buffers" line for how many pages were accessed.
There doesn't seem to be any consistent difference between v14 and my
fix. So I concur it doesn't seem to matter much.

I played some more with plotting the curve. I wrote a little python
program to make an animation of it, and also simulated how the points
would be divided into pages, assuming that each GiST page can hold 200
tuples (I think the real number is around 150 with default page size).
In the animation, the leaf pages appear as rectangles as it walks
through the Z-order curve. This is just a simulation by splitting all
the points into batches of 200 and drawing a bounding box around each
batch. I haven't checked the actual pages as the GiST creates, but I
think this gives a good idea of how it works.

The animation shows that there's quite a lot of overlap between the
pages. It's not necessarily this patch's business to try to improve
that, and the non-sorting index build isn't perfect either. But it
occurs to me that there's maybe one pretty simple trick we could do:
instead of blindly filling the leaf pages in Z-order, collect tuples
into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples,
or something in that ballpark, or maybe go all the way up to work_mem.
When the buffer fills up, call the picksplit code to divide the buffer
into the actual pages, and flush them to disk. If you look at the
animation and imagine that you would take a handful of pages in the
order they're created, and re-divide the points with the split
algorithm, there would be much less overlap.

- Heikki

Attachment Content-Type Size
gist-point-test.py text/x-python 1.8 KB
zorder.mp4 video/mp4 3.4 MB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2020-09-07 14:20:53 Re: Improving connection scalability: GetSnapshotData()
Previous Message Juan José Santamaría Flecha 2020-09-07 14:03:38 Re: A micro-optimisation for walkdir()