Re: Query advice request

From: Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>
To: Francisco Leovey <fleovey(at)yahoo(dot)com>, "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Query advice request
Date: 2010-10-09 23:07:40
Message-ID: 4CB0F5BC.5080308@vmsinfo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Francisco Leovey wrote:
> Well, it is a good enough implementation for us, a very large legal
> document database.
> We are forced by government policy to only use free software.
> Full phrase search can be simulated easily with or's to extract the
> documents.
> And we have proximity search done by the application.
>
I opted for Sphinx which is also free software. It has connector for
PostgreSQL and works perfectly. It can index 55 million documents
without a problem:

Sphinx 1.10-id64-beta (r2420)
Copyright (c) 2001-2010, Andrew Aksyonoff
Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/etc/sphinx.conf'...
indexing index 'mover'...
collected 55616085 docs, 101706.4 MB
sorted 15298.1 Mhits, 100.0% done
total 55616085 docs, 101706364580 bytes
total 18132.419 sec, 5609089 bytes/sec, 3067.21 docs/sec
total 6705 reads, 647.896 sec, 6460.7 kb/call avg, 96.6 msec/call avg
total 23240 writes, 307.031 sec, 4144.8 kb/call avg, 13.2 msec/call avg
[Sat Oct 9 05:03:20.204 2010] [32745] using config file
'/usr/local/etc/sphinx.conf'...
[Sat Oct 9 05:03:20.230 2010] [32745] listening on all interfaces,
port=9312
Sphinx 1.10-id64-beta (r2420)

This index contains 55 million documents and takes 100GB of space. It's
not integrated with Postgres the way Tsearch2 is, but it has PHP and
Perl API's and all the capabilities I need. The biggest problem was not
with the index but with the database and optimizing the SQL accessing a
huge partitioned table. To that end, I even upgraded to Postgres 9.0,
which can resolve conditions like 'IS NOT NULL' by using indexes. The
3rd software I mentioned is Lucene, produced by the Apache foundation,
with the price equal to the price of Sphinx and Lucene. Here is my table:

psql (9.0.1)
Type "help" for help.

news=# set search_path=moreover;
SET
news=# select count(*) from moreover_documents;
count
----------
55892429
(1 row)

There is a text column, which is indexed, and I need all the usual
search options: phrases, proximity, quorum and alike.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com

Browse pgsql-novice by date

  From Date Subject
Next Message e-letter 2010-10-10 19:51:58 Re: permissions failure to copy csv data
Previous Message Mladen Gogala 2010-10-09 18:23:51 Re: Query advice request