optimizing a cpu-heavy query

From: Joel Reymont <joelr1(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: optimizing a cpu-heavy query
Date: 2011-04-26 15:16:19
Message-ID: FE56AC3F-6398-41FC-B9D0-B5BEF295A745@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Folks,

I'm trying to optimize the following query that performs KL Divergence [1]. As you can see the distance function operates on vectors of 150 floats.

The query takes 12 minutes to run on an idle (apart from pgsql) EC2 m1 large instance with 2 million documents in the docs table. The CPU is pegged at 100% during this time. I need to be able to both process concurrent distance queries and otherwise use the database.

I have the option of moving this distance calculation off of PG but are there other options?

Is there anything clearly wrong that I'm doing here?

Would it speed things up to make the float array a custom data type backed by C code?

Thanks in advance, Joel

[1] http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

---

CREATE DOMAIN topics AS float[150];
CREATE DOMAIN doc_id AS varchar(64);

CREATE TABLE docs
(
id serial,
doc_id doc_id NOT NULL PRIMARY KEY,
topics topics NOT NULL
);

CREATE OR REPLACE FUNCTION docs_within_distance(vec topics, threshold float)
RETURNS TABLE(id doc_id, distance float) AS $$
BEGIN
RETURN QUERY
SELECT *
FROM (SELECT doc_id, (SELECT sum(vec[i] * ln(vec[i] / topics[i]))
FROM generate_subscripts(topics, 1) AS i
WHERE topics[i] > 0) AS distance
FROM docs) AS tab
WHERE tab.distance <= threshold;
END;
$$ LANGUAGE plpgsql;

--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------

Browse pgsql-performance by date

  From Date Subject
Next Message gnuoytr 2011-04-26 15:20:33 Re: Time to put theory to the test?
Previous Message Joel Reymont 2011-04-26 15:15:42 tuning on ec2