fulltext searching via a custom index type

From: Eric Ridge <ebr(at)tcdi(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: fulltext searching via a custom index type
Date: 2003-12-26 19:54:56
Message-ID: 61356C28-37DD-11D8-A406-000A95BB5944@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

(I started with this addressed to the -general list, but it just
doesn't seem like a general usage topic. If -hackers is the wrong
place, please point me in the right direction).

I've been working on a custom index type (access method) that allows
postgres to do fulltext searching (including phrase and proximity) that
I believe is more flexible and natural than "tsearch2" (no offense to
the tsearch2 guys... it's been a source of inspiration). I also plan
on providing "term browsing" and hit-position information.

I'm using Xapian (www.xapian.org) as the backend fulltext engine, and
I'm trying to learn more of the innards of postgres as I go. I've only
spent about 18-24 hours on this, so it's nowhere near complete, and I
really hesitate to even mention it in a public forum. But if in the
end it doesn't suck it might make a great contrib/ package. *shrug*
who knows?

In a nutshell, I've implemented all the necessary access method
functions to teach postgres about a new index type named "xapian", and
so far, everything is really great. It works just like any other index
type would:

create table test (stuff varchar(255), more_stuff text);
create index idxstuff on test using xapian (stuff);
create index idxmore_stuff on test using xapian (more_stuff);

insert into test (stuff, more_stuff) values ('this is stuff', 'this is
more stuff');
insert into test (stuff, more_stuff) values ('my name is eric ridge',
'i like to drink beer');

select * from test where stuff => 'stuff' or more_stuff => '"drink
beer"';
stuff | more_stuff
-----------------------+----------------------
this is stuff | this is more stuff
my name is eric ridge | i like to drink beer
(2 rows)

All this aside, I've got some questions related to indexes and query
plans.

1) Is it possible for an access method to receive some kind of "DROP
INDEX" notification? As far as I can tell, the answer is "no", but it
can't hurt to ask.

2) When does the query planner decide that it needs to "Filter"
results, and is there any way to turn this off or otherwise fool the
query planner into NOT doing Filters (even if only for certain
operators)?

For example, this query:
select * from test where stuff => 'stuff' AND NOT more_stuff =>
'"drink beer"';
has this plan:
Index Scan using idxstuff on test (cost=0.00..-0.98 rows=250
width=177)
Index Cond: ((stuff)::text => 'stuff'::text)
Filter: (NOT (more_stuff => '"drink beer"'::text))

In this case, postgres is forced to re-parse the contents of the
"more_stuff" field (via the defined procedure for the => operator) for
every row returned by the previous index scan, just so it can determine
if the field contains the phrase 'drink beer' or not. Since so much
overhead is involved here, it would be cool if postgres could somehow
do another index scan.

Maybe there's some way for the operator function to know not only the
Datum value, but the actual field (and ItemPointer)?

3) How does one get the $PGDATA directory? getenv() doesn't seem ideal
since PGDATA can be specified as a command-line argument.

4) Can a function be registered as part of a transaction, pre-commit --
so the function can have an opportunity to abort the transaction. I've
seen RegisterEOXactCallback, but it's not quite what I'm looking for.

5) are there discussions in the archives of this list (or other pgsql-
lists) that discuss fulltext searching that y'all think are worth
reading?

thanks in advance for your time and input!

eric

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martin Marques 2003-12-26 20:07:01 Re: between
Previous Message Tom Lane 2003-12-26 19:26:37 Re: between

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Meskes 2003-12-26 20:06:28 Docu question
Previous Message markw 2003-12-26 19:19:21 DBT-2 PostgreSQL on STP w/ LVM2