high-dimensional knn-GIST tests (was Re: Cube extension kNN support)

From: Gordon Mohr <gojomo-pgsql(at)xavvy(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)
Date: 2013-10-24 01:50:34
Message-ID: 52687CEA.9060903@xavvy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/22/13 4:38 PM, Stas Kelvich wrote:
> Hello, hackers.
>
> Here is the patch that introduces kNN search for cubes with
> euclidean, taxicab and chebyshev distances.

Thanks for this! I decided to give the patch a try at the bleeding edge
with some high-dimensional vectors, specifically the 1.4 million
1000-dimensional Freebase entity vectors from the Google 'word2vec' project:

https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Unfortunately, here's what I found:

* with 1000-dimensional vectors, the index size on disk grows to many
times (1000x or more) the size of the indexed data, making other tests
of the index's helpfulness impractical. (Perhaps, other parameter-tuning
can help?)

* with 500-dimensional or 100-dimensional vectors, the index size is
more manageable -- 2x to 6x the data size -- but using the index
significantly hurts performance on top-10 nearest-neighbor searches,
making it much slower than a full table scan. (The planner still tries
to use the index, even though it's hurting.)

Much more detail for the curious is below...

Regarding the dataset & hoped-for results:

The word2vec project's example scripts do their synonym/analogy
demonstrations by loading the entire 5GB+ dataset into main memory
(~3min), do a full scan of all vectors (~40sec) to find those nearest a
target vector.

My motivating question was: could the data be loaded to Postgresql using
the cube type, and kNN-GIST indexed using this patch, then do speedy
index-assisted ranked-nearest-neighbor queries from the database?

(While the patch's distance_euclid is not the same cosine-distance the
word2vec examples use, the freebase entity vectors are all unit vectors,
and even additions of vectors can be scaled to unit length. My intuition
is that euclidean-distances on the unit sphere will be in the same rank
order as cosine-distance, so the cube distance_euclid/<-> should enable
the same sort of synonym/analogy demos.)

Regarding the 1000-dimensional case:

It was necessary to change three compiled-in implementation limits. In
the four steps that the need for change became evident:

(1) in contrib/cube/cubedata.h, increase CUBE_MAX_DIM (to accept vectors
with more than 100 dimensions)

(2) in src/include/pg_config.h, increase BLCKSZ to 16384 (otherwise 1000
64-bit floats in a single column gave an <ERROR: row is too big>,
mentioning the 8160 limit - large cube values not TOASTable?)

(3) in src/include/access/itup.h, increase INDEX_SIZE_MASK to 0x3fff
(otherwise encountering an <ERROR: index row requires 16016 bytes,
maximum size is 8191> when attempting to create the index>

(4) in src/include/pg_config.h, again increase BLCKSZ now to 32768
(otherwise encountering an <ERROR: index row size 16416 exceeds maximum
5440 for index "pg_class_relname_nsp_index"> when attempting to create
the index>

With the cube-kNN patch applied and these other changes, I was able to
import the 1.4M freebase vectors and do a full-scan nearest-neighbors
query. (My starting postgresql codebase was the github mirror of 9.4dev
as of about a week ago.)

The psql transcript:

word2vec=# \timing
Timing is on.
word2vec=# CREATE EXTENSION cube;
CREATE EXTENSION
Time: 42.141 ms
word2vec=# CREATE TABLE word2vec ( word varchar(200), vec cube );
CREATE TABLE
Time: 8.532 ms
word2vec=# COPY word2vec FROM PROGRAM 'zcat
/tmp/pgvectors/freebase-vectors-skipgram1000.pgtxt.gz';
COPY 1422903
Time: 12399065.498 ms
word2vec=# SELECT word, dist FROM (SELECT word,
distance_euclid(vec,(SELECT vec FROM word2vec WHERE word='geoffrey
hinton')) AS dist FROM word2vec) AS subquery ORDER BY dist LIMIT 11;
word | dist
-------------------------+------------------
geoffrey hinton | 0
marvin minsky | 1.03892498287268
paul corkum | 1.05221701690288
william richard peltier | 1.06244397334495
brenda milner | 1.06387762685894
john charles polanyi | 1.07444446452295
leslie valiant | 1.07735786596934
hava siegelmann | 1.08148623006629
hans moravec | 1.0862034591185
david rumelhart | 1.08748431130477
godel prize | 1.08774264379264
(11 rows)

Time: 310753.976 ms

That's 3.5 hours to do the import and 5 minutes to do the query; this is
on a 2009 MacBook Pro with 8GB RAM and SSD.

Confirming the intuition above, these 10-nearest are the same entities
in the same order as on the word2vec project page's example output,
though the euclidean distances are of course different than the cosine
distances.

The DATA directory is 23GB after the import of 1.4 million rows. In the
word2vec uncompressed binary format, this dataset is about 5.4GB, so
this word2vec-table cube-column representation involves about 4X expansion.

So, to the main question: can that query be sped up by building a
kNN-GIST index? Here the problems start.

word2vec=# CREATE INDEX word2vec_index ON word2vec USING gist (vec);

This attempt ran for hours, consuming another 80GB+ before failing due
to disk full.

In fact, the largest table with 1000-dimensional vectors for which I was
able to build a gist index was a mere 100 rows. That index-build took
about 14 minutes and grew the DATA directory like so:

73MB pgdata # empty word2vec table
75MB pgdata # after COPY FROM of 100 vectors
8.4GB pgdata # after CREATE INDEX

Of course with just 100 rows the index isn't practically needed or
helpful for query speed-up.

Even trying just 500 rows, the CREATE INDEX command ran for hours before
failing by consuming all available disk space (about 90GB more). It
seems the knn-GIST index overhead for 1000-dimensional cubes grows
faster than linearly in the number of rows.

I was able to complete index builds with fewer dimensions.

100-dimensions, most-frequent 850K entities:

The dataset was trimmed to the first (most-frequent) 850K entities and
each vector truncated to its first 100-dimensions. (Even though this
would fit in the usual CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK limits, I
still used the customized build with limits expanded for the 1000d-case.)

Observed data sizes and operation times were:

73MB pgdata # empty word2vec table
2.6GB pgdata # after COPY FROM of 850K vectors, taking 216s

before indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~2s

5.0GB pgdata # after CREATE INDEX, taking 1344s

after indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~57s # "Index Scan using word2vec_index…"

So the availability of the index causes a significant slowdown... and
the planner does not learn to choose the faster full sequential-scan.

500-dimensions, most-frequent 100K entities:

The dataset was trimmed to the first (most-frequent) 100K entities and
each vector truncated to its first 500-dimensions. (Again, still using
the custom build with upped CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK.)

Observed data sizes and operation times were:

73MB pgdata # empty word2vec table
1.6GB pgdata # after COPY FROM of 100K vectors, taking 266s

before indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~2s

4.8GB pgdata # after CREATE INDEX, taking 977s

after indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~46s # "Index Scan using word2vec_index…"

Dropping the index makes the <-> query fast again.

Open questions and tentative conclusions:

The massive knn-GIST index overhead for 1000-dimensional vectors makes
it hard to evaluate whether a fully-built index could be useful on large
datasets.

Perhaps, some aspects of the knn-GIST support implicitly assume a
low-dimensionality (2-4) in the data, and large numbers of dimensions
cause pathological index sizes?

Or, something specific about this dataset (all vectors on the unit
sphere) is a challenging case?

In the truncated 100d or 500d cases, the index can be built, but slows
rather than speeds nearest-neighbor queries that use the index.

The evaluated patch's definitions of distance_euclid and other support
function seem straightforward, and give proper results on simple test
cases... so the problem, if any, would appear to be in the general
distance-driven knn-GIST indexing and query-planning.

If there is a type of cube-type represented dimensional data where this
indexing helps, it may only be with far fewer dimensions than 100,
and/or far more rows than 100K/850K.

If a knn-GIST implementor/expert has suggestions for tuning the index
overhead and behavior -- perhaps a different penalty or picksplit
function? -- I'd be happy to try those out and report back.

Otherwise, I hope this write-up will help evaluate the patch or save
time for others tempted to try similar knn-GIST indexing of
higher-dimensional data.

- Gordon Mohr

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Johnston 2013-10-24 02:53:31 Re: Using indexes for ORDER BY and PARTITION BY clause in windowing functions
Previous Message Tom Lane 2013-10-24 00:50:30 Re: RULE regression test fragility?