Re: How to implement a SP-GiST index as a extension module?

From: Connor Wolf <connorw(at)imaginaryindustries(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: How to implement a SP-GiST index as a extension module?
Date: 2017-10-31 02:04:02
Message-ID: CAAVqP=quF+ddn-G134QNDAezoWqf92iCLp7PXgEfHjh7aHm2Ag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was mostly unclear on how I'd go about attaching the extension functions
to the relevant indexing mechanism. From the looks of the vptree.tar.gz
file (which is really, *really* helpful, incidentally!), a it's done via a
custom operator class, which then gets passed to the actual index creation
mechanism when you're declaring the index.

I think I had looked at that at one point, but it's been a while. In my
case, I'm using discrete-cosine-transform based perceptual hashes for
searching. They are nice and compact (64-bits per hash), while still
producing good search results. I have a dataset of ~36 million images, and
it does searches in < 50 milliseconds with a hamming distance of 4, while
touching ~0.25% of the tree (And occupying ~18 GB of ram).

My BK tree is up on github here
<https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp>,
if anyone needs something like that (BSD licensed, pretty well tested).
There's also a python wrapper for it.

I'll probably not have time to poke about until this weekend, but thanks!
Connor

On Mon, Oct 30, 2017 at 4:50 AM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:

> Hi!
>
> On Sun, Oct 29, 2017 at 12:07 PM, Connor Wolf <
> wolf(at)imaginaryindustries(dot)com> wrote:
>
>> I'm looking at implementing a custom indexing scheme, and I've been
>> having trouble understanding the proper approach.
>>
>> Basically, I need a BK tree, which is a tree-structure useful for
>> indexing arbitrary discrete metric-spaces (in my case, I'm interested in
>> indexing across the hamming edit-distance of perceptual hashes, for fuzzy
>> image searching). I'm *pretty* sure a SP-GiST index is the correct index
>> type, as my tree is intrinsically unbalanced.
>>
>
> Yes, SP-GiST is appropriate index type for BK tree. I'm pretty sure BK
> tree could be implemented as SP-GiST opclass.
> The only thing worrying me is selection pivot values for nodes. SP-GiST
> builds by insertion of index tuples on by one. First pivot value for root
> node in SP-GIST would be created once first leaf page overflows. Thus, you
> would have to select this pivot value basing on very small fraction in the
> beginning of dataset.
> As I know, BK tree is most efficient when root pivot value is selected
> after looking in whole dataset and then hierarchically to subtrees.
>
> BTW, did you try my extension for searching similar images. It's quite
> primitive, but works for some cases.
> https://github.com/postgrespro/imgsmlr
> https://wiki.postgresql.org/images/4/43/Pgcon_2013_similar_images.pdf
>
> I have a functional stand-alone implementation of a BK-Tree, and it works
>> very well, but the complexity of managing what is basically a external
>> index for my database has reached the point where it's significantly
>> problematic, and it seems to be it should be moved into the database.
>>
>
> Sure, moving this index to the database is right decision.
>
> Anyways, looking at the contents of postgres/src/backend/access/spgist,
>> it looks pretty straightforward in terms of the actual C implementation,
>> but I'm stuck understanding how to "install" a custom SP-GiST
>> implementation. There are several GiST indexing implementations in the
>> contrib directory, but no examples for how I'd go about implementing a
>> loadable SP-GiST index.
>>
>> Basically, my questions are:
>>
>> - Is it possible to implement a SP-GiST indexing scheme as a loadable
>> module?
>>
>> Yes, it's possible to define SP-GiST.
>
>>
>> - If so, how?
>>
>> The pretty same way as GiST opclass extension. You have to define
> supporting functions and operators and then define operator class over them.
>
>>
>> - And is there an example I can base my implementation off of?
>>
>>
>>
>> I'm relatively comfortable with C (much moreso with C++), but I haven't
>> spent a lot of time looking at the postgresql codebase. I don't think I
>> could start from a empty folder and make a properly-implemented module in
>> any reasonable period of time, so if I have a working example for some sort
>> of index that uses the same interfaces that would really help a lot
>>
> I don't think there is an example in PostgreSQL source code tree or on
> github. But I've attached by early experiment with VP-tree (seems to be
> pretty same as BK tree) using SP-GiST (see vptree.tar.gz). Basing on this
> experiment I realized that it's important to select root pivot value basing
> on the whole dataset. However, for your metric/dataset/queries it might
> appear to be different.
>
> It also would be nice to someday improve SP-GiST to support some global
> strategies on index creation. In particular, it would allow to resolve
> selection of pivot values problem that I mention above. Right now my
> colleagues and me don't have time for that. But I can assist you with
> advises if you will decide to implement that.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-10-31 02:16:36 Re: WIP: long transactions on hot standby feedback replica / proof of concept
Previous Message Badrul Chowdhury 2017-10-31 01:19:48 Re: Re: protocol version negotiation (Re: Libpq PGRES_COPY_BOTH - version compatibility)