Re: OK, does anyone have any better ideas?

From: mlw <markw(at)mohawksoft(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: OK, does anyone have any better ideas?
Date: 2000-12-09 00:48:03
Message-ID: 3A318143.ECB0A019@mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-novice

Tom Lane wrote:
>
> mlw <markw(at)mohawksoft(dot)com> writes:
> > I have a working version of a text search engine. I want to make it work
> > for Postgres (I will be releasing it GPL). It can literally find the
> > occurrence of a string of words within 5 million records in a few
> > milliseconds.
>
> Where are the records coming from? Are they inside the database?
> (If not, why do you care about integrating this with Postgres?)
>
> It seems like the right way to integrate this sort of functionality
> is to turn it into a kind of index, so that you can do
>
> SELECT * FROM mytable WHERE keyfield ~~~ 'search string';
>
> where ~~~ is the name of some operator that is associated with the
> index. The temporary-table approach you are taking seems inherently
> klugy, and would still be awkward even if we had functions returning
> recordsets...

OK, I get the misunderstanding, you are absolutely right it is VERY
kludgy.

It is sort of like a bitmap index, but it is more of a search engine. I
actually have it working on a commercial website. You run a program
periodically (cron job?) that executes a query, the query is then parsed
and an index of words, keys, ranks and phrase meta-data is created. You
also specify which fields in the query should be indexed and which field
will be the "key." (It is not ACID if I understand what they term
means.) The data for the text search need not even be in the database,
as long as the "key" being indexed is.

Then you call search with a string, such as "the long and winding road"
or "software OR hardware AND engineer NOT sales." A few milliseconds
later, a list of key/rank pairs are produced. This is FAR faster than
the '~~~' operator because it never does a full table scan. It is
assumed that the "key" field specified is properly indexed.

If I had a way of getting the key/rank result pair deeper into Postgres,
it would be an amazing platform to make some serious high speed search
applications. Think about a million resumes' online and searchable with
an arbitrary text string to get a list of candidates, powered by
Postgres, handling 100 queries a second.

Right now, the way I have it working is PHP makes the search call and
then executes a query with the first result (highest rank) and returns
the data. If I could get the key/rank pair into postgres as a table or
multiple searches into postgres as a set of tables, then you could do
some amazing queries really really fast.

Still, you said that "select foo from bar where key = textsearch('bla
bla',..)" could not be done, and my previous example was the only other
way I have been able to even prototype my idea.

--
http://www.mohawksoft.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2000-12-09 00:55:00 Re: Re: COPY BINARY file format proposal
Previous Message Tom Lane 2000-12-09 00:14:38 Re: 7.0.3(nofsync) vs 7.1

Browse pgsql-novice by date

  From Date Subject
Next Message mlw 2000-12-09 01:17:34 Re: OK, does anyone have any better ideas?
Previous Message Joel Burton 2000-12-08 19:51:52 Re: Buncho-Text during commands.