GiST kNN search queue (Re: KNN-GiST with recheck)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-10 15:50:16
Message-ID: 54886BB8.9040000@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
>> >3. A binary heap would be a better data structure to buffer the rechecked
>> >values. A Red-Black tree allows random insertions and deletions, but in
>> >this case you need to insert arbitrary values but only remove the minimum
>> >item. That's exactly what a binary heap excels at. We have a nice binary
>> >heap implementation in the backend that you can use, see
>> >src/backend/lib/binaryheap.c.
>> >
> Hmm. For me binary heap would be a better data structure for KNN-GiST at
> all :-)

I decided to give this a shot, replacing the red-black tree in GiST with
the binary heap we have in lib/binaryheap.c. It made the GiST code
somewhat simpler, as the binaryheap interface is simpler than the
red-black tree one. Unfortunately, performance was somewhat worse. That
was quite surprising, as insertions and deletions are both O(log N) in
both data structures, but the red-black tree implementation is more
complicated.

I implemented another data structure called a Pairing Heap. It's also a
fairly simple data structure, but insertions are O(1) instead of O(log
N). It also performs fairly well in practice.

With that, I got a small but measurable improvement. To test, I created
a table like this:

create table gisttest (id integer, p point);
insert into gisttest select id, point(random(), random()) from
generate_series(1, 1000000) id;
create index i_gisttest on gisttest using gist (p);

And I ran this query with pgbench:

select id from gisttest order by p <-> '(0,0)' limit 1000;

With unpatched master, I got about 650 TPS, and with the patch 720 TPS.
That's a nice little improvement, but perhaps more importantly, the
pairing heap implementation consumes less memory. To measure that, I put
a MemoryContextStats(so->queueCtx) call into gistendscan. With the above
query, but without the "limit" clause, on master I got:

GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998
chunks); 21296 used

And with the patch:

GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502
chunks); 21072 used

That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice
to reduce that memory consumption, as there is no hard upper bound on
how much might be needed. If the GiST tree is really disorganized for
some reason, a query might need a lot more.

So all in all, I quite like this patch, even though it doesn't do
anything too phenomenal. It adds a some code, in the form of the new
pairing heap implementation, but it makes the GiST code a little bit
simpler. And it gives a small performance gain, and reduces memory usage
a bit.

- Heikki

Attachment Content-Type Size
knn-gist-pairingheap-1.patch text/x-diff 17.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-12-10 15:57:57 Re: logical column ordering
Previous Message Petr Jelinek 2014-12-10 15:03:18 Re: [COMMITTERS] pgsql: Keep track of transaction commit timestamps