Skip site navigation (1) Skip section navigation (2)

Re: Query advice request

From: Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>
To: Francisco Leovey <fleovey(at)yahoo(dot)com>, "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Query advice request
Date: 2010-10-09 23:07:40
Message-ID: 4CB0F5BC.5080308@vmsinfo.com (view raw or flat)
Thread:
Lists: pgsql-novice
Francisco Leovey wrote:
> Well, it is a good enough implementation for us, a very large legal 
> document database.
> We are forced by government policy to only use free software.
> Full phrase search can be simulated easily with or's to extract the 
> documents.
> And we have proximity search done by the application.
>
I opted for Sphinx which is also free software. It has connector for 
PostgreSQL and works perfectly. It can index 55 million documents 
without a problem:

Sphinx 1.10-id64-beta (r2420)
Copyright (c) 2001-2010, Andrew Aksyonoff
Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/etc/sphinx.conf'...
indexing index 'mover'...
collected 55616085 docs, 101706.4 MB
sorted 15298.1 Mhits, 100.0% done
total 55616085 docs, 101706364580 bytes
total 18132.419 sec, 5609089 bytes/sec, 3067.21 docs/sec
total 6705 reads, 647.896 sec, 6460.7 kb/call avg, 96.6 msec/call avg
total 23240 writes, 307.031 sec, 4144.8 kb/call avg, 13.2 msec/call avg
[Sat Oct  9 05:03:20.204 2010] [32745] using config file 
'/usr/local/etc/sphinx.conf'...
[Sat Oct  9 05:03:20.230 2010] [32745] listening on all interfaces, 
port=9312
Sphinx 1.10-id64-beta (r2420)

This index contains 55 million documents and takes 100GB of space. It's 
not integrated with Postgres the way Tsearch2 is, but it has PHP and 
Perl API's and all the capabilities I need. The biggest problem was not 
with the index but with the database and optimizing the SQL accessing a 
huge partitioned table. To that end, I even upgraded to Postgres 9.0, 
which can resolve conditions like 'IS NOT NULL'  by using indexes. The 
3rd software I mentioned is Lucene, produced by the Apache foundation, 
with the price equal to the price of Sphinx and Lucene. Here is my table:

psql (9.0.1)
Type "help" for help.

news=# set search_path=moreover;
SET
news=# select count(*) from moreover_documents;
  count  
----------
 55892429
(1 row)

There is a text column, which is indexed, and I need all the usual 
search options: phrases, proximity, quorum and alike.

-- 
Mladen Gogala 
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com 


pgsql-novice by date

Next:From: e-letterDate: 2010-10-10 19:51:58
Subject: Re: permissions failure to copy csv data
Previous:From: Mladen GogalaDate: 2010-10-09 18:23:51
Subject: Re: Query advice request

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group