Re: Designing an extension for feature-space similarity search

From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-16 18:18:24
Message-ID: 4F3D4870.3020704@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Jay Levitt<jay(dot)levitt(at)gmail(dot)com> writes:
>> - I'm not sure how to represent arbitrary column-like features without
>> reinventing the wheel and putting a database in the database.
>
> ISTM you could define a composite type and then create operators and an
> operator class over that type. If you were trying to make a btree
> opclass there might be a conflict with the built-in record_ops opclass,
> but since you're only interested in GIST I don't see any real
> roadblocks.

Perfect. Composite types are exactly what I need here; the application can
declare its composite type and provide distance functions for each member,
and the extension can use those to calculate similarity. How do I introspect
the composite type's pg_class to see what it contains? I assume there's a
better way than SPI on system catalogs :) Should I be using systable_*
functions from genam, or is there an in-memory tree? I feel like funcapi
gets me partway there but there's magic in the middle.

Can you think of any code that would serve as a sample, maybe whatever
creates the output for psql's \d?

> The main potential disadvantage of this is that you'd have
> the standard tuple header as overhead in index entries --- but maybe the
> entries are large enough that that doesn't matter, and in any case you
> could probably make use of the GIST "compress" method to get rid of most
> of the header. Maybe convert to MinimalTuple, for instance, if you want
> to still be able to leverage existing support code for field extraction.

Probably not worth it to save the 8 bytes; we're starting out at about 20
floats per row. But good to know for later optimization...

>
>> - Can domains have operators, or are operators defined on types?
>
> I think the current state of play is that you can have such things but
> the system will only consider them for exact type matches, so you might
> need more explicit casts than you ordinarily would. However, we only
> support domains over base types not composites, so this isn't really
> going to be a profitable direction for you anyway.

Actually, as mentioned to Alexander, I'm thinking of domains per feature,
not for the overall tuple, so birthdate<->birthdate differs from
now()<->posting_date. Sounds like that might work - I'll play.
>
>> - Does KNN-GiST run into problems when<-> returns values that don't "make
>> sense" in the physical world?
>
> Wouldn't surprise me. In general, non-strict index operators are a bad
> idea. However, if the indexed entities are records, it would be
> entirely your own business how you handled individual fields being NULL.

Yeah, that example conflated NULLs in the feature fields (we don't know your
birthdate) with <-> on the whole tuple. Oops.

I guess I can just test this by verifying that KNN-GiST ordered by distance
returns the same results as without the index.

Thanks for your help here.

Jay

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-16 18:29:10 Re: patch for parallel pg_dump
Previous Message Dimitri Fontaine 2012-02-16 17:42:26 Re: Command Triggers