big data - slow select (speech search)

From: Michal Fapso <michal(dot)fapso(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: big data - slow select (speech search)
Date: 2010-07-01 22:34:53
Message-ID: AANLkTilAMfcbnEpk_7eAr-ikWzFptS4VtvIT0s5RvWmH@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I have quite a simple query but a lot of data and the SELECT query is
too slow. I will be really grateful for any advice on this.

--------------------------------------------------
The background info:

I work on a speech search engine which differs from text search in
having more words (hypotheses) on the same position and each
hypothesis has some weight (probability) of occurrence.

When a word 'hello' appears in a document 'lecture_1', there is a row
in the table hyps (see below) which contains an array of all positions
of word 'hello' in the document 'lecture_1' and for each position it
contains a weight as well.

I need the positions to be able to search for phrases. However, here I
simplified the query as much as I could without a significant
reduction in speed.

I know there is tsearch extension which could be more appropriate for
this but I didn't try that yet. The size of my data will be the same
which seems to be the issue in my case. But maybe I am wrong and with
tsearch it will be much faster. What do you think?

--------------------------------------------------
Preconditions:

First I cleared the disk cache:
sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Then run the postgresql deamon and with psql client I connected to my
database. The first thing I did then was executing the SELECT query
described below. It took about 4.5 seconds. If I rerun it, it takes
less than 2 miliseconds, but it is because of the cache. I need to
optimize the first-run.

--------------------------------------------------
Hardware:

laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM

--------------------------------------------------
Version:

PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
4.4.1-4ubuntu9) 4.4.1, 32-bit

compiled from sources, only --prefix=... argument given to ./configure

--------------------------------------------------
Schema:

CREATE TABLE hyps (
docid INT,
wordid INT,
positions INT[],
weights REAL[],
length INT,
total_weight REAL
);
COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
CREATE INDEX hyps_docid_index ON hyps USING hash (docid);

shared_buffers = 300MB ...this is the only thing I changed in the config

I tried that also with btree indices instead of hash and surprisingly
the SELECT query was a bit faster. I would expect hash to be faster.

The index on 'docid' column is there because I need to be able to
search also in a particular document or in a set of documents.
--------------------------------------------------
Table info:

- rows = 5490156
- average length of positions vectors = 19.5
- total number of items in positions vectors = 107444304
- positions and weights in one row have the same number of items, but
for each row the number may differ.
- table data are loaded only once (using COPY) and are not modified anymore
- there are 369 various docid and 161460 various wordid
- VACUUM was executed after COPY of data

--------------------------------------------------
Query:

EXPLAIN ANALYZE SELECT h1.docid
FROM hyps AS h1
WHERE h1.wordid=65658;

Bitmap Heap Scan on hyps h1 (cost=10.97..677.09 rows=171 width=4)
(actual time=62.106..4416.864 rows=343 loops=1)
Recheck Cond: (wordid = 65658)
-> Bitmap Index Scan on hyps_wordid_index (cost=0.00..10.92
rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
Index Cond: (wordid = 65658)
Total runtime: 4432.015 ms

The result has 343 rows and there are 9294 items in positions vectors in total.

--------------------------------------------------
Comparison with Lucene:

If I run the same query in Lucene search engine, it takes 0.105
seconds on the same data which is quite a huge difference.

--------------------------------------------------
Synthetic data set:

If you want to try it yourself, here is a script which generates the
data for COPY command. I don't know whether it is possible to send
attachments here, so I put the script inline. Just save it as
create_synthetic_data.pl and run it by 'perl
create_synthetic_data.pl'. With these synthetic data the SELECT query
times are around 2.5 seconds. You can try the SELECT query with
'wordid' equal 1, 2, 3, ...10000.

#!/usr/bin/perl
# Create synthetic data for PostgreSQL COPY.

$rows = 5490156;
$docs = 369;
$words = 161460;
$docid = 0;
$wordid = 0;

for ($row=0; $row<$rows; $row++) {

my $sep = "";
my $positions = "";
my $weights = "";
my $total_weight = 0;
my $items = int(rand(39))+1;

if ($row % int($rows/$docs) == 0) {
$docid++;
$wordid = 0;
}
$wordid++;

for ($i=0; $i<$items; $i++) {
$position = int(rand(20000));
$weight = rand(1);
$positions .= $sep.$position;
$weights .= $sep.sprintf("%.3f", $weight);
$total_weight += $weight;
$sep = ",";
}
print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n";
}

If you need any other info, I will gladly provide it.

Thank You for Your time.
Miso Fapso

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Eliot Gable 2010-07-02 00:46:13 Highly Efficient Custom Sorting
Previous Message Rajesh Kumar Mallah 2010-07-01 17:59:20 Re: how to (temporarily) disable/minimize benefits of disk block cache or postgresql shared buffer