Full Text Search ideas

From: Howard Rogers <hjr(at)diznix(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Full Text Search ideas
Date: 2010-07-17 23:28:32
Message-ID: AANLkTikGeqmxcmaGZ7rqsUMNaXNzAQQAqg47729jveD1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I asked recently about a performance problem I'd been having with some
full text queries, and got really useful help that pointed me to the
root issues. Currently, I'm trying to see if our document search
(running on Oracle Text) can be migrated to PostgreSQL, and the reason
I asked that earlier question points to a fundamental design issue
we'll have with PostgreSQL that doesn't affect us in Oracle (not, I
hasten to add, that that means Oracle is better/right-er/whatever.
It's just different -but the difference will cause us a problem).

Consider the following example (which is just one of 40-odd I could
have picked).

Some of our documents are in panoramic format, for example. But not
many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic',
called 'sb12'. It's either 'y' or 'n' for any document. So a search
for 'sb12n' (find me all documents which are not panoramic) is
logically the same as a search for 'not sb12y'. However, 95% or more
of documents will be an sb12n, because hardly any documents are
panoramic in the first place. So. although the numeric outcome of
'sb12n' and 'not sb12y' will always be the same, you would have to
check the entire table to find which ones are 'sb12n' (because most
documents are marked that way), whereas you'd only have to check the
5% of records to find 'sb12y', because so few are marked that way.

But in Oracle Text, this doesn't seem to happen:

SQL> select count(*) from search_digital_rm where
contains(textsearch,'bat and sb12n')>0;

 COUNT(*)
----------
     3040

Elapsed: 00:00:00.10

SQL> select count(*) from search_digital_rm where
contains(textsearch,'bat not sb12y')>0;

 COUNT(*)
----------
     3040

Elapsed: 00:00:00.06

In both cases, the same number of records are returned. But, within a
margin of error, the time taken to do each test is about the same.
Even though the first test must be matching 'sb12n' for many millions
of records, it's taking not much longer than the search for 'sb12y',
which can only match about 90,000. It would seem (I can't tell from
the explain plan itself) as though what's happened is that the set of
'bat' records has been fetched first (about 8000 in all). For so few
records, whether you're looking for sb12y or sb12n then becomes mostly
irrelevant for timing purposes, and hence the duration equivalence of
both queries.

This is not what happens in PostgreSQL, however (as I now know, thanks
to the help on my question from a couple of days ago):

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
count
-------
 3849
(1 row)

Time: 408.962 ms

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','bat
& !sb12y');
count
-------
 3849
(1 row)

Time: 11.533 ms

Now, one test takes about 40 times longer than the other, though the
one taking just 11ms is as fast as Oracle can manage (impressive,
considering I've done absolutely nothing to tune this PostgreSQL
testbed as yet!). Logically equivalent the two tests may be, but
hunting through lots of sb12n records and working out which are
related to bats is apparently a lot slower than finding things the
other way around, it would seem.

I'm wondering firstly if there's any way I can configure PostgreSQL
FTS so that it produces the sort of results we've gotten used to from
Oracle, i.e., where search speeds do not go up wildly when a 'search
term' is applied that happens to be used by the vast majority of
document records. (For example, we currently allows searches for file
types, where 80% of documents would be "word documents", another 19%
would be PDFs and the remaining 1% of documents could be pretty much
anything else! We can't have people searching for "definitely want
only Word documents" if that means matching 8 million records and
search speeds shoot to the stratosphere as a result).

Secondly, I'm open to any suggestions as to how you would organise
things or re-write the SQL so that the "attribute filter" is only
applied to the small subset of records which match the relevant "real
word" search term, if that's what's needed here. In other words, is my
best bet in the earlier examples to fetch *all* "bat" records, and
then nest that query within an outer query that adds a test for a
separate attribute column? Or is something else called for here?

Would appreciate any thoughts on the subject!

Regards
HJR

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Sid 2010-07-18 11:17:16 PL/Perlu - list of columns
Previous Message Devrim GÜNDÜZ 2010-07-17 22:35:59 Re: Fedora 13 killed dblink this week...