Re: Question regarding contrib/fulltextindexing

From: Andrew McMillan <andrew(at)catalyst(dot)net(dot)nz>
To: Derek Barrett <derekbarrett(at)graffiti(dot)net>
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: Question regarding contrib/fulltextindexing
Date: 2002-07-14 23:39:12
Message-ID: 1026689952.5041.26.camel@kant.mcmillan.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Mon, 2002-07-15 at 10:41, Derek Barrett wrote:
> Thanks Andrew, that solution seems to be easier in my mind.
>
>
> Okay, so I create a lookup table for the description field.
>
> CREATE TABLE lookup_description (
>
> pk integer
> id integer
> search_word varchar(50)
>
> )
>
> (pk is the primary key, id being a foreign key)
>
>
> (By the way, any recommendation in sizing the search_word field? Are there strings that are large enough that aren't worth indexing?)
>
> My user INSERTS the following string:
>
> "The quick brown fox jumped over the moon and another fox."
>
> In my code, I use a for loop, and take the text string and put it into an array, and fill up the lookup_description table. Of course, I will create a noisewords filter to remove words like the, a, an from this list. What about duplicate words? Should those be filtered out as well? In this example, fox is duplicated. I would assume that leaving in the duplicates, might be useful later if I decide to implement a relevency-type of search engine. (Rank the results based on how many times, "fox" is found).
>
> TABLE lookup_description
>
> pk id search_word
> -- -- -----------
> 1 1 quick
> 2 1 brown
> 3 1 fox
> 4 1 jumped
> 5 1 over
> 6 1 moon
> 7 1 another
> 8 1 fox
>
> Then I can create an index on the search_word column. Later when I do my SELECT query, I will join this lookup query to the main query.
>
> Is that the idea?

Yes, that's the basic idea. A few points though:
1) the index table doesn't need a special 'primary key' field - what
value does that add?

2) the search_word can be defined as 'text', so you don't need to
specify the maximum length.

3) Realistically there is little advantage searching for words longer
than (say) 15 characters, but they also appear very infrequently. You
have to ensure that your search functionality does the same sort of
transformations to searched for words as it did when inserting words
into the index table. This means that if your insert truncates, or
doesn't insert some words in a stop list (if, of, a ...) then your
search functionality needs to do the same.

4) I usually have a stop_list of words that don't get indexed too:
CREATE TABLE stop_list (
stop_word TEXT PRIMARY KEY
);
I usually load this into a hash in my perl script that is doing the
loading, but it can be more complex to deal with this appropriately in
the actual searching code - you need a first pass through the search
list to remove stop_list words, and a second pass to build the search
query.

5) you need to create an index on search_word:
CREATE INDEX lookup_description_srch
ON lookup_description ( search_word );

Hope this helps,
Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267
Are you enrolled at http://schoolreunions.co.nz/ yet?

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Derek Barrett 2002-07-15 03:07:42 Re: Question regarding contrib/fulltextindexing
Previous Message Derek Barrett 2002-07-14 22:41:02 Re: Question regarding contrib/fulltextindexing