OK, lets step back a bit and see if there is a solution that fits what we
think we need and PostgreSQL.
Lets talk about FTSS, its something I can discuss easily. It is a two
stage system with an indexer and a server. Only the data to be indexed is
in the database, all the FTSS data structures are in external files.
The indexer creates a number of data structures.
A table of document references, one entry per document.
A table of words parsed, one word per entry
A table of compressed bitmaps, one (or more) bitmap(s) per word.
The relation of bits in the word bitmaps is one bit per document as
ordered by the document table, i.e. if bit 5 is set high, then the fith
document is selected.
(Let's not discuss phrase analysis at this point.)
When the indexer runs, it executes a query that produces a set of results.
Each result has a document reference which is stored in the FTSS document
The results are parsed out as discrete words, new words are added to the
word table, previously used word's reference counts are incremented.
A bitmap is created for each new word.
The bit of the current document is set to "1."
This procedure runs for each record in the query.
The server runs as follows:
accepts an HTTP request for search
Parses out the discrete words.
The word is found in the word table.
The word's bitmap is retrieved from the bitmap table.
A series of logical functions are performed on the retrieved bitmaps.
The resulting bitmap contains all the relevant documents in the form of
bits correlating to offsets into the document reference table.
The list of document references is returned to the database and found
using a "WHERE IN" clause.
Now, it occurs to me that if my document reference table can refer to
something other than an indexed primary key, I can save a lot of index
processing time in PostgreSQL if I can have a "safe" analogy to CTID.
I should be able to "know" more about a particular row (document) being
referenced, because I have already been through the table once.
I need to be able to "know" which rows are newer than my FTSS index so I
can search those rows more dynamically. I currently do this by saving the
highest value during indexing.
> pgsql(at)mohawksoft(dot)com writes:
>> A "persistent reference" index would be like almost any other index
>> that as new items are added to a table a new entry is added to the end
>> the index. When a row is updated, its CTID is updated in the table.
> This *does not work* under MVCC: it can't cope with referencing
> multiple simultaneously existing versions of a row. In general, any
> index design that includes the words "update in place" can be rejected
> out of hand.
> In any case I still fail to see the advantage compared to an ordinary
> serial primary key. You could have made your bitmaps reference the
> serial numbers directly, instead of an intermediate table. (Either way
> still fails to handle MVCC updates, unless the indexable attributes
> cannot be changed by updates; but the intermediate table isn't helping
> or hurting that.)
> A bitmap that indexes CTIDs directly could work, because it doesn't need
> to update any entries in-place when a table row is updated. I didn't
> care for the details of Victor's design because (a) the intermediate
> list of CTIDs looks bulky and (b) it seemed to require a lot of data
> shuffling to cope with growth of the table. But in principle it would
> regards, tom lane
In response to
pgsql-hackers by date
|Next:||From: Nicolai Tufar||Date: 2005-03-01 20:19:31|
|Subject: Re: [pgsql-hackers-win32] snprintf causes regression tests to fail|
|Previous:||From: Victor Y. Yegorov||Date: 2005-03-01 19:52:24|
|Subject: Re: bitmap AM design|