Re: [PATCH] reduce page overlap of GiST indexes built using sorted method

From: Björn Harrtell <bjorn(dot)harrtell(at)gmail(dot)com>
To: Aliaksandr Kalenik <akalenik(at)kontur(dot)io>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] reduce page overlap of GiST indexes built using sorted method
Date: 2022-01-17 22:54:08
Message-ID: CANhDX=bhvpzD968f7Y7MGgCDt=K0f-MRUj-YJYNBYuB9=SVCQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Aliaksandr,

Nice work on this. I've been following it a bit since the regression when
it was noted and it sparked renewed interest in R-tree structure and
optimization for me.

As for ideas. I'm not deep into details of postgresql and gist, but I've
learned that the node size for gist indexes are equal to the page size, and
as such they are quite large in this context. This is what caused the
overly flat tree when increasing the fill factor, if I'm not mistaken, and
why it helps to decrease the fill factor.

I think there is a large potential if tree node size could be tweaked to
fit the use case and combine higher fill factor with optimal tree depth.
It's data dependent so it would even make sense to make it an index
parameter, if possible.

There might be some deep reason in the architecture that I'm unaware of
that could make it difficult to affect the node size but regardless, I
believe there could be a substantial win if node size could be controlled.

Regards,

Björn

Den mån 17 jan. 2022 kl 23:46 skrev Aliaksandr Kalenik <akalenik(at)kontur(dot)io>:

> Hey!
>
> Postgres 14 introduces an option to create a GiST index using a sort
> method. It allows to create indexes much faster but as it had been
> mentioned in sort support patch discussion the faster build performance
> comes at cost of higher degree of overlap between pages than for indexes
> built with regular method.
>
>
> Sort support was implemented for GiST opclass in PostGIS but eventually
> got removed as default behaviour in latest 3.2 release because as it had
> been discovered by Paul Ramsey
> https://lists.osgeo.org/pipermail/postgis-devel/2021-November/029225.html
> performance of queries might degrade by 50%.
>
> Together with Darafei Praliaskouski, Andrey Borodin and me we tried
> several approaches to solve query performance degrade:
>
> - The first attempt was try to decide whether to make a split
> depending on direction of curve (Z-curve for Postgres geometry type,
> Hilbert curve for PostGIS). It was implemented by filling page until
> fillfactor / 2 and then checking penalty for every next item and keep
> inserting in current page if penalty is 0 or start new page if penalty is
> not 0. It turned out that with this approach index becomes significantly
> larger whereas pages overlap still remains high.
> - Andrey Borodin implemented LRU + split: a fixed amount of pages are
> kept in memory and the best candidate page to insert the next item in is
> selected by minimum penalty among these pages. If the best page for
> insertion is full, it gets splited into multiple pages, and if the amount
> of candidate pages after split exceeds the limit, the pages insertion to
> which has not happened recently are flushed.
> https://github.com/x4m/postgres_g/commit/0f2ed5f32a00f6c3019048e0c145b7ebda629e73.
> We made some tests and while query performance speed using index built with
> this approach is fine a size of index is extremely large.
>
> Eventually we made implementation of an idea outlined in sort support
> patch discussion here
> https://www.postgresql.org/message-id/flat/08173bd0-488d-da76-a904-912c35da446b(at)iki(dot)fi#09ac9751a4cde897c99b99b2170faf3a
> that several pages can be collected and then divided into actual index
> pages by calling picksplit. My benchmarks with data provided in
> postgis-devel show that query performance using index built with patched
> sort support is comparable with performance of query using index built with
> regular method. The size of index is also matches size of index built with
> non-sorting method.
>
>
> It should be noted that with the current implementation of the sorting
> build method, pages are always filled up to fillfactor. This patch changes
> this behavior to what it would be if using a non-sorting method, and pages
> are not always filled to fillfactor for the sake of query performance. I'm
> interested in improving it and I wonder if there are any ideas on this.
>
>
> Benchmark summary:
>
>
> create index roads_rdr_idx on roads_rdr using gist (geom);
>
>
> with sort support before patch / CREATE INDEX 76.709 ms
>
> with sort support after patch / CREATE INDEX 225.238 ms
>
> without sort support / CREATE INDEX 446.071 ms
>
>
> select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
>
>
> with sort support before patch / SELECT 5766.526 ms
>
> with sort support after patch / SELECT 2646.554 ms
>
> without sort support / SELECT 2721.718 ms
>
>
> index size
>
>
> with sort support before patch / IDXSIZE 2940928 bytes
>
> with sort support after patch / IDXSIZE 4956160 bytes
>
> without sort support / IDXSIZE 5447680 bytes
>
> More detailed:
>
> Before patch using sorted method:
>
>
> postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
> gist(geom);
>
> CREATE INDEX
>
> Time: 76.709 ms
>
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> Time: 5766.526 ms (00:05.767)
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> Time: 5880.142 ms (00:05.880)
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> Time: 5778.437 ms (00:05.778)
>
>
> postgres=# select gist_stat('roads_rdr_geom_idx_sortsupport');
>
> gist_stat
>
> ------------------------------------------
>
> Number of levels: 3 +
>
> Number of pages: 359 +
>
> Number of leaf pages: 356 +
>
> Number of tuples: 93034 +
>
> Number of invalid tuples: 0 +
>
> Number of leaf tuples: 92676 +
>
> Total size of tuples: 2609260 bytes+
>
> Total size of leaf tuples: 2599200 bytes+
>
> Total size of index: 2940928 bytes+
>
>
>
> (1 row)
>
> After patch using sorted method:
>
> postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
> gist(geom);
>
> CREATE INDEX
>
> Time: 225.238 ms
>
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> Time: 2646.554 ms (00:02.647)
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> Time: 2499.107 ms (00:02.499)
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> Time: 2519.815 ms (00:02.520)
>
>
> postgres=# select gist_stat('roads_rdr_geom_idx_sortsupport');
>
> gist_stat
>
> ------------------------------------------
>
> Number of levels: 3 +
>
> Number of pages: 605 +
>
> Number of leaf pages: 600 +
>
> Number of tuples: 93280 +
>
> Number of invalid tuples: 0 +
>
> Number of leaf tuples: 92676 +
>
> Total size of tuples: 2619100 bytes+
>
> Total size of leaf tuples: 2602128 bytes+
>
> Total size of index: 4956160 bytes+
>
>
>
> (1 row)
>
> With index built using default method:
>
> postgres=# create index roads_rdr_geom_idx_no_sortsupport on roads_rdr
> using gist(geom);
>
> CREATE INDEX
>
> Time: 446.071 ms
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> Time: 2721.718 ms (00:02.722)
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> Time: 3549.549 ms (00:03.550)
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
> count
>
> --------
>
> 505806
>
> (1 row)
>
>
> postgres=# select gist_stat('roads_rdr_geom_idx_no_sortsupport');
>
> gist_stat
>
> ------------------------------------------
>
> Number of levels: 3 +
>
> Number of pages: 665 +
>
> Number of leaf pages: 660 +
>
> Number of tuples: 93340 +
>
> Number of invalid tuples: 0 +
>
> Number of leaf tuples: 92676 +
>
> Total size of tuples: 2621500 bytes+
>
> Total size of leaf tuples: 2602848 bytes+
>
> Total size of index: 5447680 bytes+
>
>
> (1 row)
>
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-17 23:08:02 Re: Push down time-related SQLValue functions to foreign server
Previous Message Peter Geoghegan 2022-01-17 22:41:10 Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations