Designing an extension for feature-space similarity search

From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Designing an extension for feature-space similarity search
Date: 2012-02-15 20:34:49
Message-ID: 4F3C16E9.90808@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[Preamble: I've been told that the hackers list is appropriate for
extension-related topics like this, even if it's not about contributing to
core. If I'm misappropriating, please let me know.]

Goal: Personalized, context-relevant query results

We are building a deeply personalized site; think "OKCupid for product
recommendations" or "Pinterest for people with your tastes". We use psych
research to measure and predict your personality and traits along a number
of scales (dimensions), and then we connect you with people, products and
content we think you'll like.

I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)
- The distance between two vectors is a function that aggregates the
distances of each dimension, along with arbitrary other arguments - for
instances, it might take the weighted average of the dimensions

That aggregation (which may not literally be an aggregate; I'm not sure yet)
needs to happen in a SELECT list, which means it needs to be fast, which
means all this (or at least much of it) has to be C.

The "simplest thing that works" is probably to hack up the cube extension,
declare that everything (except inner pages) must be a zero-volume cube
(cube_is_point()), map our non-float features onto floats somehow, and
hard-code all the distance functions and the aggregation function.

But I think this sort of similarity-search engine has general utility, and I
also want to make it easy for us to add and subtract dimensions without too
much pain; that should be DDL, not code. So thinking about how this might
evolve...

- I'm not sure how to represent arbitrary column-like features without
reinventing the wheel and putting a database in the database. hstore only
stores text, probably for this reason; I took a look at the earlier json
patch and saw that it handled only a few core data types. Have there been
any other PoCs that involved collections of hetereogenous data? I almost
want an actual instance of an "anyarray".

- Alternatively, is there a way to index an entire, arbitrary row, rather
than on a column on that row? I'm fine with this extension requiring its own
table, so I leave the data where it is in the row, and only worry about
indexing it. I can't just use functional indexes, because I'll need to
provide operators and support functions to GiST. Maybe I have a fake
sentinel column, where all the operators use SPI to introspect the row,
treat each column as a feature dimension, call the underlying operators on
each column's data type, etc.

- Can domains have operators, or are operators defined on types?

- Does KNN-GiST run into problems when <-> returns values that don't "make
sense" in the physical world? For instance, let's say NULL <-> NULL returns
a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <->
NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that
could even describe a triangle - but my spidey sense is tingling on this.

- Are there previous discussions, patches, abandoned projects, etc. that
this reminds you of and that I should go research?

Thanks for any thoughts, and I'd love collaborators or even mentors - we
plan to open source whatever we produce here, and I don't have quite the
theoretical background it takes to do this properly.

Jay Levitt

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-15 20:54:14 Re: [COMMITTERS] pgsql: Speed up in-memory tuplesorting.
Previous Message Dimitri Fontaine 2012-02-15 20:25:08 Re: Command Triggers