Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )

From: Eric Ridge <ebr(at)tcdi(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Date: 2004-01-08 23:16:28
Message-ID: B021131D-4230-11D8-ADB4-000A95BB5944@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Thanks to everyone that provided input about this idea. After taking
everything into consideration and talking with Olly Betts from the
Xapian project, what I'm going to do is implement a btree that supports
multiple roots and an "tag" value of arbitrary length for each item in
the tree. Then implement a new Xapian backend that uses it. In the
end, it'll be much more efficient and much less dirty than this silly
filesystem thing I have now. And hopefully, concurrency and WAL will
be easier to deal with too. Having the existing nbtree AM as guide
will be very useful.

I have one issue that could potentially make all of this completely
useless, and it's related to Postgres deciding to use a Filter instead
of an Index Scan.

I asked this question last week, and Tom Lane responded with a
solution, but I don't think I explained myself very well. And now that
I've got a bit more experience with some of the PG internals, maybe I
can ask the question more intelligently.

Given this query:
select id from table where document_text ==> 'food' and title ==>
'water';

Postgres generates this plan:
Index Scan using idxtitle on table (cost=0.00..4.01 rows=1 width=8)
Index Cond: (title ==> 'water'::text)
Filter: (document_text ==> 'food'::text)

The problem is, the "document_text" column is *really* big. Average
value length is 171k.

With this query plan, my operator procedure is forced to re-parse the
document_text column from each row returned by the index scan against
the title column, and do a bunch of Xapian tricks for each row. The
overhead is huge.

The composite index solution doesn't seem ideal here because there's
absolutely no way I can anticipate every combination of fields a user
might choose to search.

Forgetting about composite indexes for a moment, is postgres even
capable of doing 2 index scans in this situation? Does it know how do
take the intersection of two scans?

My AM's cost estimate function literally sets the selectivity,
correlation, total cost, and startup cost values to zero (and I've
tried tons of combinations of really big, really small, and really
negative values). My thought behind setting them all to zero was that
if the cost function basically says, "There is no cost", I could fool
Postgres into wanting to use the index everywhere it can. Sadly, this
doesn't work out.

Now, I realize I can fix my cost estimate function to return better
costs for the title and document_text fields so that PG will instead
decide to index scan on document_text, but what I really want to do is
make PG do an index scan for each field.

Can PG even do this?

I'd appreciate any thoughts on this. Hopefully, I'm just missing
something really obvious.

thanks!

eric

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron St-Pierre 2004-01-08 23:24:41 Re: 7.4, 'group by' default ordering?
Previous Message Eric Freeman 2004-01-08 23:13:19 Re: 7.3.3 drop table takes very long time

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-01-09 00:10:07 Re: RFC: bufmgr locking changes
Previous Message Shachar Shemesh 2004-01-08 20:04:56 OLE DB driver