Re: How to implement a SP-GiST index as a extension module?

From: Connor Wolf <connorw(at)imaginaryindustries(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: How to implement a SP-GiST index as a extension module?
Date: 2017-11-02 04:23:17
Message-ID: CAAVqP=r2ADz49eKpc5uPwfDftAwVKKhra=ty1Vtr0xjung-TgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Ok, more questions.

I've been studying the implementation Alexander Korotkov sent, and I'm not
seeing how to map
some of the components onto the changes in the SP-GiST system that occured
between Postgresql 9.2 and 9.3.

The changes at that point seem to have been to change xxx_inner_consistent
and xxx_leaf_consistent from taking
an input containing a Datum pointing to the query conditional to a list of
ScanKeys which each encapsulate one filter condition.

The issue here is that for VP trees (or the BK tree I want to eventually
implement), the filter condition requires two parameters:
- The maximum allowed distance from the target value
- The actual target value.

The ScanKeys struct appears to only be able to contain a conditional type,
and a single parameter, such as "less then *x*", "above *y*",
and so forth. Looking at the existing spgkdtreeproc.c and
spgquadtreeproc.c, their mechanism for passing more complex conditions
through to the xxx_consistent functions appears to be to encapsulate the
entire condition in a custom type. For example,
their mechanism for querying if something is within a certain envelope is
done by having the envelope be described
by the "BOX" type.

As such:
Will compound queries as I describe above basically require a custom type
to make it possible? My (admittedly naive) expectation
is that the eventual query for this index will look something like "SELECT
* FROM example_table WHERE indexed_column <=> target_value < 4;",
with "<=>" being the operator for the relevant distance calculation
(hamming, for the BK tree, numeric for the VP-tree).

The existing VP-tree code appears to not support multiple operators
whatsoever, probably because it was very preliminary.

Thanks!
Connor

On Mon, Oct 30, 2017 at 7:04 PM, Connor Wolf <
connorw(at)imaginaryindustries(dot)com> wrote:

> I was mostly unclear on how I'd go about attaching the extension functions
> to the relevant indexing mechanism. From the looks of the vptree.tar.gz
> file (which is really, *really* helpful, incidentally!), a it's done via a
> custom operator class, which then gets passed to the actual index creation
> mechanism when you're declaring the index.
>
> I think I had looked at that at one point, but it's been a while. In my
> case, I'm using discrete-cosine-transform based perceptual hashes for
> searching. They are nice and compact (64-bits per hash), while still
> producing good search results. I have a dataset of ~36 million images, and
> it does searches in < 50 milliseconds with a hamming distance of 4, while
> touching ~0.25% of the tree (And occupying ~18 GB of ram).
>
> My BK tree is up on github here
> <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp>,
> if anyone needs something like that (BSD licensed, pretty well tested).
> There's also a python wrapper for it.
>
> I'll probably not have time to poke about until this weekend, but thanks!
> Connor
>
>
>
> On Mon, Oct 30, 2017 at 4:50 AM, Alexander Korotkov <
> a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
>> Hi!
>>
>> On Sun, Oct 29, 2017 at 12:07 PM, Connor Wolf <
>> wolf(at)imaginaryindustries(dot)com> wrote:
>>
>>> I'm looking at implementing a custom indexing scheme, and I've been
>>> having trouble understanding the proper approach.
>>>
>>> Basically, I need a BK tree, which is a tree-structure useful for
>>> indexing arbitrary discrete metric-spaces (in my case, I'm interested in
>>> indexing across the hamming edit-distance of perceptual hashes, for fuzzy
>>> image searching). I'm *pretty* sure a SP-GiST index is the correct
>>> index type, as my tree is intrinsically unbalanced.
>>>
>>
>> Yes, SP-GiST is appropriate index type for BK tree. I'm pretty sure BK
>> tree could be implemented as SP-GiST opclass.
>> The only thing worrying me is selection pivot values for nodes. SP-GiST
>> builds by insertion of index tuples on by one. First pivot value for root
>> node in SP-GIST would be created once first leaf page overflows. Thus, you
>> would have to select this pivot value basing on very small fraction in the
>> beginning of dataset.
>> As I know, BK tree is most efficient when root pivot value is selected
>> after looking in whole dataset and then hierarchically to subtrees.
>>
>> BTW, did you try my extension for searching similar images. It's quite
>> primitive, but works for some cases.
>> https://github.com/postgrespro/imgsmlr
>> https://wiki.postgresql.org/images/4/43/Pgcon_2013_similar_images.pdf
>>
>> I have a functional stand-alone implementation of a BK-Tree, and it works
>>> very well, but the complexity of managing what is basically a external
>>> index for my database has reached the point where it's significantly
>>> problematic, and it seems to be it should be moved into the database.
>>>
>>
>> Sure, moving this index to the database is right decision.
>>
>> Anyways, looking at the contents of postgres/src/backend/access/spgist,
>>> it looks pretty straightforward in terms of the actual C implementation,
>>> but I'm stuck understanding how to "install" a custom SP-GiST
>>> implementation. There are several GiST indexing implementations in the
>>> contrib directory, but no examples for how I'd go about implementing a
>>> loadable SP-GiST index.
>>>
>>> Basically, my questions are:
>>>
>>> - Is it possible to implement a SP-GiST indexing scheme as a
>>> loadable module?
>>>
>>> Yes, it's possible to define SP-GiST.
>>
>>>
>>> - If so, how?
>>>
>>> The pretty same way as GiST opclass extension. You have to define
>> supporting functions and operators and then define operator class over them.
>>
>>>
>>> - And is there an example I can base my implementation off of?
>>>
>>>
>>>
>>> I'm relatively comfortable with C (much moreso with C++), but I haven't
>>> spent a lot of time looking at the postgresql codebase. I don't think I
>>> could start from a empty folder and make a properly-implemented module in
>>> any reasonable period of time, so if I have a working example for some sort
>>> of index that uses the same interfaces that would really help a lot
>>>
>> I don't think there is an example in PostgreSQL source code tree or on
>> github. But I've attached by early experiment with VP-tree (seems to be
>> pretty same as BK tree) using SP-GiST (see vptree.tar.gz). Basing on this
>> experiment I realized that it's important to select root pivot value basing
>> on the whole dataset. However, for your metric/dataset/queries it might
>> appear to be different.
>>
>> It also would be nice to someday improve SP-GiST to support some global
>> strategies on index creation. In particular, it would allow to resolve
>> selection of pivot values problem that I mention above. Right now my
>> colleagues and me don't have time for that. But I can assist you with
>> advises if you will decide to implement that.
>>
>> ------
>> Alexander Korotkov
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-11-02 04:37:13 Re: Account for cost and selectivity of HAVING quals
Previous Message Masahiko Sawada 2017-11-02 04:04:35 Re: Explicit relation name in VACUUM VERBOSE log