Postgres + Xapian (was Re: fulltext searching via a custom index type )

From: Eric B(dot)Ridge <ebr(at)tcdi(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Date: 2004-01-02 04:19:07
Message-ID: CE6BCC23-3CDA-11D8-BF11-000A95D98B3E@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Dec 26, 2003, at 4:04 PM, Tom Lane wrote:
> Eric Ridge <ebr(at)tcdi(dot)com> writes:
>> Xapian has it's own storage subsystem, and that's what I'm using to
>> store the index... not using anything internal to postgres (although
>> this could change).
>
> I would say you have absolutely zero chance of making it work that way.

I still think this is one of the best quotes I've heard in awhile. :)

> It might be worth pointing out here than an index AM is not bound to
> use
> exactly the typical Postgres page layout.

Thanks again for this little bit of info. It was just enough to get me
thinking about how to make "it work".

Xapian is basically a big btree index, except it's 5 btree indexes.
One for terms, one for posts (terms with positions), one for positions,
one for arbitrary document values, and one for the documents
themselves. Each index is made up of 3 physical files on disk. All
told there's 17 files for a single Xapian index (15 db files, a
versioninfo file, and a lock file).

I couldn't think of a way to create a whole new database type for
Xapian that could deal with managing 5 btree indexes inside of Postgres
(other than using tables w/ standard postgres btree index on certain
fields), so instead, I dug into Xapian and abstracted out it's
filesystem i/o (open, read, write, etc).

(as an aside, I did spend some time pondering ways to adapt Postgres'
nbtree AM to handle this, but I just don't understand how it works)

Once I had Xapian's filesystem i/o encapsulated into a nice little C++
class, I embarked on creating a mini "filesystem" ontop of Postgres'
storage subsystem. In essence, I've now got a Postgres access method
that mirrors the basics of a filesystem, from creating/open files to
reading from and writing to them, in addition to truncation and
deletion.

After that, it was just a matter of the glue code to teach Xapian to
use this "filesystem" for all its filesystem i/o, and voila!, Xapian
works ontop of Postgres' storage subsystem and I didn't have to rewrite
Xapian from scratch. And surprisingly, despite the additional overhead
of this filesystem abstraction layer, it's still very fast... esp. once
Buffers get cached.

I've still got more work to do (like dealing with locking and general
concurrency issues, not to mention bugs I haven't found yet), but it's
working *really* well in a single-user environment.

So here's the important question: How stupid is this?

I've done some benchmarking against tsearch2. Attached are the queries
and execution times on my dual 2gig G5 w/ 2gig ram.

The table contains 51,160 records. It's every text file contained on
my computer (which includes multiple copies of all my java projects).
All told, it's 337,343,569 bytes of data, with an average file size of
6,594 bytes. The Xapian operator is "=>", and tsearch2's operator is
"@@". I ran each query 6 times, and just took the best execution time.

It's also worth noting that my document parser is much different than
tsearch2's. I'm splitting words on non-alphanumerics (and currently am
not using stopwords), and it seems that tsearch2 tries to do something
more intelligent, so the # of results returned vary widely between
tsearch2 and Xapian. I'm not offering an opinion on which way is
"better".

I've got a few more questions about transactions, locking, and a few
other things, but I just thought I'd throw this out as a status report
and to see if there's any kind of reaction.

thanks for your time.

eric

Attachment Content-Type Size
query_timings.txt text/plain 1.6 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Marc G. Fournier 2004-01-02 04:28:31 Re: Mnogosearch (Was: Re: website doc search is ... )
Previous Message Tom Lane 2004-01-02 04:09:23 Re: Mnogosearch (Was: Re: website doc search is ... )

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2004-01-02 04:56:33 Re: time format
Previous Message Tom Lane 2004-01-02 03:32:24 Re: [HACKERS] Spinlock support for linux-hppa?