Re: TSearch2 and optimisation ...

From: George Essig <george_essig(at)yahoo(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: TSearch2 and optimisation ...
Date: 2004-09-13 03:03:57
Message-ID: 20040913030357.66842.qmail@web53701.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

--- Herv<inputPiedvache <herve(at)elma(dot)fr> wrote:

> George,
>
> I have well read many pages about this subject ... but I have not found any
> thing for the moment to really help me ...
> What can I do to optimize my PostgreSQL configuration for a special use of
> Tsearch2 ...
> I'm a little dispointed looking the Postgresql Russian search engine using
> Tsearch2 is really quick ... why I can't haev the same result with a
> bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000
> records ?
>
> Regards,
> --
> Herv<inputPiedvache
>
> Elma Ing<inputierie Informatique
> 6 rue du Faubourg Saint-Honor<input> F-75008 - Paris - France
> Pho. 33-144949901
> Fax. 33-144949902
>

Tsearch does not scale indefinitely. It was designed for fast online updates and to be integrated
into PostgreSQL. My understanding is that it uses a bloom filter together with bit string
signatures. Typically, full text searches use inverted indexes, scale better, but are slower to
update.

My understanding is that tsearch has a practical limit of 100,000 distinct word stems or lexemes.
Note that word stems are not words. Word stems are what are actually stored in a tsvector after
parsing and dictionary processing.

The key to making tsearch fast is to keep the number of word stems low. You decrease the number
of word stems by using stop words, various dictionaries, synonyms, and preprocessing text before
it gets to tsearch. You can find what word stems are stored in a tsvector column by using the
stat function. For examples of how to use the stat function, see:

http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html

Note that the stat function will take a long time to run on large tables.

Performance tuning must be done on a case by case basis. It can take some time to try different
things and see the change in performance. Each time you try something new, use the stat function
to see how the number of word stems has changed.

The largest project I used tsearch2 on contained 900,000 records. Without performance tuning,
there were 275,000 distinct word stems. After performance tuning, I got it down to 14,000
distinct word stems.

By using the stat function, I noticed some obvious stop words that were very frequent that nobody
would ever search for. For how to use stop words, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html

Also I noticed some strange patterns by looking through all of the word stems.

In one case, strings of 3 to 7 words were joined together with hyphens to indicate category
nesting. Tsearch would store these long hyphenated words intact and also store the stem of each
individual word. I made a judgment call that no one would ever search for the long hyphenated
words, so I preprocessed the text to remove the hyphens.

I also noticed that many of the word stems were alphanumeric IDs that were designed to be unique.
There were many of these IDs in the tsvector column although each ID would occur only once or
twice. I again preprocessed the text to remove these IDs, but created a btree index on a varchar
column representing the IDs. My search form allows users to either search full text using
tsearch2 or search IDs using 'LIKE' queries which use a btree index. For 'LIKE' queries, it was
another matter to get postgres to use the btree index and not use a sequential scan. For this,
see:

http://www.postgresql.org/docs/7.4/static/indexes-opclass.html

Last, I noticed that most users wanted to restrict the full text search to a subset determined by
another column in the table. As a result, I created a multicolumn gist index on an integer column
and a tsvector column. For how to setup a multicolumn gist index, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html

There are no easy answers. Like I said, performance tuning must be done on a case by case basis.

Hope this helps,

George Essig

Browse pgsql-performance by date

  From Date Subject
Next Message Vijay Moses 2004-09-13 04:30:41 Four table join with million records - performance improvement?
Previous Message Christopher Browne 2004-09-13 02:29:00 Re: Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables