Re: Full Text Search 101?

From: Aurynn Shaw <ashaw(at)commandprompt(dot)com>
To: Jonathan <jharahush(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Full Text Search 101?
Date: 2009-11-12 21:29:46
Message-ID: 4AFC7E4A.7080305@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Jonathan;

>
> I posted to this forum once before and was able to receive help.
> Thanks again!
>
> I'm trying to implement full text search capabilities. Basically, I
> have a very simple "data catalog" type of website (http://
> gis.drcog.org/datacatalog), where the user can type in a word or words
> to search for records that match the criteria. I have a table with a
> few fields, but I want to be able to search/index two fields -- the
> "name" of the dataset and the "description." (or more, if I can index
> fields from other tables too, that would be great). I'd like to be
> able to use a full text search to rank the results in terms of
> relevance.

You can set up an index per table, but I've found that having a single
vector table with bridge tables to the data I want to search works well.
This would be in the style of:

create table vectors (
id serial primary key,
vector tsvector not null
);

create table interesting_data (
id serial primary key,
textual text not null
);

CREATE TABLE vector_to_interesting (
v_id int not null references vectors(id),
i_id int not null references interesting_data(id)
);

And then

SELECT id.*
FROM interesting_data itd,
vectors v,
vector_to_interesting itv
WHERE v.id = itv.v_id
AND itv.i_id = itd.id
AND v.vector @@ to_tsquery('searchpattern');

Allowing for some procedures around that that return multiple row types,
to the client software.

>
> What is the best way to handle this? I've read through the PostgreSQL
> documentation and don't quite understand it although I'm trying to
> understand and am 'playing around' with this on a development server.
> Is the default text search configuration enough? I've tried setting my
> own configuration but get errors about not finding .dict dictionary
> files? I have a default install of PostgreSQL 8.4.
>
> Also, I've created a tsvector column and created indexes, but it
> didn't seem to delete stop words from the indexes. Should the stop
> words be indexed?

The default "english" configuration in PostgreSQL should have done this.
Building your own configuration tends towards being a bit more advanced,
and "english" should suit most needs.

>
> Also, I don't quite understand how to create indexes but rank certain
> words as being more important than others, for instance, maybe having
> words that come from the "name" column carrying more importance than
> words coming from the "description" column.

For part of this, there's ts_rank(tsvector, tsquery), as well as the
relative weighting system in the textsearch modules. Unfortunately, I
don't have much experience with the relative weighting, but, ts_rank()
should get you partly there.

I also found an interesting page on relative weighting @
http://www.postgresonline.com/special_feature.php?sf_name=postgresql83tsearch_cheatsheet&outputformat=html

that might be useful for you.

> Finally, I'd like
> "watersheds" to come up when someone searches for "water" so I don't
> think I have this configured properly, because this record is not
> returned.

Well, there's two ways to achieve this.
The first is to manually the stopwords, which is getting into the custom
configuration and a bit on the more advanced side.
The other is that in PG8.4, you can do

to_tsquery('water:*')

which will tell the search parser to do a partial match, which would
return "watershed", in this instance.
My testing has shown it to be a bit slower (30ms, vs 15ms for a
non-partial search), but not egregiously slow.

>
> Is there a good tutorial or maybe documentation that is a bit easier
> to understand? I know my database is far from complicated but I can't
> seem to find a configuration that works well. When I try ranking my
> results, most of the results end up with the same rank, so something
> must be wrong?
>
> Can someone point me in the right direction?
>
> Thanks for the help. I appreciate it.
>

Hope this helps. :)

Regards,
--
Aurynn Shaw

The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 ext 103
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

ashaw(at)commandprompt(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Hans-Juergen Schoenig 2009-11-12 22:26:17 Re: [GENERAL] pgday.eu
Previous Message Michael Nacos 2009-11-12 21:02:39 Re: Full Text Search 101?