Full text searching, anyone interested?

From: mlw <markw(at)mohawksoft(dot)com>
To: Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject: Full text searching, anyone interested?
Date: 2001-06-02 16:53:15
Message-ID: 3B1919FB.4C7A2795@mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I frequently rant on this about full text searching.

The fulltextsearch package under contrib is an interesting approach. As anyone
rolling their eyes at my frequent posts, know that I am also working on a full
text search system.

The two packages offer two very different approaches.

"/contrib/fulltextsearch" uses a table of word/oid pairs to find words. Such
that a full text search query would look something like this:

select * from foo where fti_foo.string = 'word' and fti_foo.id = foo.oid;

(or something like that)

Mine has, as yet, has not been contributed. I am dubious that it will happen in
the near future for various reasons. However, it is designed around a
completely different strategy, it is more like a search engine.

Like the fti package, my system parses out a table into component words,
however, rather than storing them word/oid pairs, I use an external file format
which stores a searchable dictionary of words. Each word is associated with a
bitmap.

The bitmap contains a bit for each record. Multiple words in a query are parsed
and the bitmaps retrieved. The bitmaps are then combined with boolean
operations (and/or/not).

Each bit (0...count) is associated with an integer key. The integer key can be
an oid or some other unique table ID.

Being an external system, it is implemented using 'C' based extensions to
postgres. A query on my system looks something like:

select * from table, (select ftss_direct('all { word1 word2 }') as id) as
result, where table.id = result.id;

The advantages of fti is that it, by being tied closely to Postgres, updates
automatically on inserts and updates. The disadvantage is that it is relatively
slow to perform complex searches.

The advantage of my system is that it is very fast, a loaded system can search
the bitmap index of roughly 4 words within 4 million records in about 20ms~50ms
depending on the popularity of the words. (Performance lags at startup) The
disadvantage of my system is that it is not designed to be very dynamic, it
relies extensively on the virtual memory management of the system (and is thus
limited), and not being tied too closely to PostgreSQL, has been somewhat
frustrating to getting it working as well as I would like.

I would love to find a way to get a bitmap like index native to Postgres. I
would, in fact, love and expect to do an amount of it. The problem is to do it
"right" will require a lot of work at very low levels of Postgres.

Is anyone interested in pursuing this?
How should it look?
How deeply rooted in postgres does it need to be?

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vince Vielhaber 2001-06-02 16:56:57 Re: Re: Interesting Atricle
Previous Message Bruce Momjian 2001-06-02 16:39:41 Re: Re: AW: [HACKERS] Re: Support for %TYPE in CREATE FUNCTION