Re: [PATCH] Add sortsupport for range types and btree_gist

From: Bernd Helmle <mailings(at)oopsware(dot)de>
To: jian he <jian(dot)universality(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Add sortsupport for range types and btree_gist
Date: 2024-02-08 18:14:05
Message-ID: 2f2e88bf1d44d06c1baf27bbf3db880f42a5cb87.camel@oopsware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Am Mittwoch, dem 10.01.2024 um 22:18 +0800 schrieb jian he:
>
> I split the original author's patch into 2.
> 1. Add GiST sortsupport function for all the btree-gist module data
> types except anyrange data type (which actually does not in this
> module)
> 2. Add GiST sortsupport function for anyrange data type.
>

Please find attached a new version of this patch set with the following
changes/adjustments:

- Rebased to current master
- Heavily reworked *_cmp() functions to properly
decode GPT_VARKEY and GBT_KEY input.

For some datatypes the btree comparison functions were reused and the
input arguments not properly handled. This patch adds dedicated
btree_gist sortsupport comparison methods for all datatypes.

There was another patch from Andrey Borodin (thanks again for the hint)
and a deeper review done by Heikki in [1]. I've incorporated Heikkis
findings in this patch, too.

[...]

I've also updated the btree_gist documentation to reflect the default
sorted built strategy this patch introduces now.

Additionally i did some benchmarks again on this new version on the
patch. Still, index build speed improvement is quite impressive on the
dataset originally provided by Christoph Heiss (since its not available
anymore i've uploaded it here [2] again):

HEAD
(Index was built with default buffering setting)
---------------------
REINDEX (s) 4809
CREATE INDEX (s) 4920

btree_gist sortsupport
----------------------
REINDEX (s) 573
CREATE INDEX (s) 578

I created another pgbench based custom script to measure the single
core speed of the lookup query of the bench-gist.py script. This looks
like this:

init.sql
--------
BEGIN;

DROP TABLE IF EXISTS test_dataset;
CREATE TABLE test_dataset(keyid integer not null, id text not null,
block_range int4range);
CREATE TEMP SEQUENCE testset_seq;
INSERT INTO test_dataset SELECT nextval('testset_seq'), id, block_range
FROM test ORDER BY random() LIMIT 10000;
CREATE UNIQUE INDEX ON test_dataset(keyid);

COMMIT;

bench.pgbench
-------------

\set keyid random(1, 10000)
SELECT id, block_range FROM test_dataset WHERE keyid = :keyid \gset
SELECT id, block_range FROM test WHERE id = ':id' AND block_range &&
':block_range';

Run by

for in in `seq 1 3`; do psql -qXf init.pgbench && pgbench -n -r -c 1 -T
60 -f bench.pgbench; done

With this i get the following (on prewarmed index and table):

HEAD 
-------------------------------------
pgbench single core tps=248,67

btree_gist sortsupport
----------------------------
pgbench single core tps=1830,33

This is an average over 3 runs each (complete results attached). So
this looks really impressive and i hope i didn't do something entirely
wrong (still learning about this GiST stuff).

> what I am confused:
> In fmgr.h
>
> /*
>  * Support for cleaning up detoasted copies of inputs.  This must
> only
>  * be used for pass-by-ref datatypes, and normally would only be used
>  * for toastable types.  If the given pointer is different from the
>  * original argument, assume it's a palloc'd detoasted copy, and
> pfree it.
>  * NOTE: most functions on toastable types do not have to worry about
> this,
>  * but we currently require that support functions for indexes not
> leak
>  * memory.
>  */
> #define PG_FREE_IF_COPY(ptr,n) \
> do { \
> if ((Pointer) (ptr) != PG_GETARG_POINTER(n)) \
> pfree(ptr); \
> } while (0)
>
> but the doc
> (https://www.postgresql.org/docs/current/gist-extensibility.html)
>  says:
> All the GiST support methods are normally called in short-lived
> memory
> contexts; that is, CurrentMemoryContext will get reset after each
> tuple is processed. It is therefore not very important to worry about
> pfree'ing everything you palloc.
> ENDOF_QUOTE
>
> so I am not sure in range_gist_cmp, we need the following:
> `
> if ((Datum) range_a != a)
> pfree(range_a);
> if ((Datum) range_b != b)
> pfree(range_b);
> `

Turns out this is not true for sortsupport: the comparison function is
called for each tuple during sorting, which will leak the detoasted
(and probably copied datum) in the sort memory context. See the same
for e.g. numeric and text, which i needed to change to pass the key
values correctly to the text_cmp() or numeric_cmp() function in these
cases.

I've adapted the PG_FREE_IF_COPY() macro for these functions and
introduced GBT_FREE_IF_COPY() in btree_utils_var.h, since the former
relies on fcinfo.

I'll add the patch again to the upcoming CF for another review round.

[1]
https://www.postgresql.org/message-id/c0846e34-8b3a-e1bf-c88e-021eb241a481%40iki.fi

[2] https://drive.google.com/file/d/1CPNFGR53-FUto1zjXPMM2Yrn0GaGfGFz/view?usp=drive_link

Attachment Content-Type Size
bench.pgbench.results.txt text/plain 4.1 KB
v6-Add-GIST-sortsupport-btree-gist.patch text/x-patch 38.3 KB
v6-Add-GIST-sortsupport-rangetypes.patch text/x-patch 4.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Morris 2024-02-08 18:30:01 Re: Where can I find the doxyfile?
Previous Message Tom Lane 2024-02-08 18:08:21 Re: Avoiding concurrent calls to bindtextdomain()