Re: OK, does anyone have any better ideas?

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: mlw <markw(at)mohawksoft(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: OK, does anyone have any better ideas?
Date: 2000-12-09 13:41:37
Message-ID: Pine.GSO.3.96.SK.1001209163702.4174p-100000@ra
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-novice

On Sat, 9 Dec 2000, mlw wrote:

> Date: Sat, 09 Dec 2000 08:18:10 -0500
> From: mlw <markw(at)mohawksoft(dot)com>
> To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
> Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>,
> Hackers List <pgsql-hackers(at)postgresql(dot)org>
> Subject: Re: [HACKERS] OK, does anyone have any better ideas?
>
> Oleg Bartunov wrote:
> >
> > We need multi-key B-tree like index for such problem.
> > Our full text search engine is based on arrays and we need to find quickly
> > is some number exists in array - some kind of index over int array.
> > We're currently testing GiST approach and seems will have some conclusions
> > soon. I think multi-key B-tree like index would be better in my
> > opinion, but this requires to much work and knowledge of postgres's internals.
> > Yesterday I read about UBTree, seems like it's good for index and query
> > sets. Currently postgres has no set specific methods.
>
> The way I do my search indexing is with bitmap objects and a word
> dictionary. One creates a searchable dictionary of words by scanning the
> selected records. So, in one query that results in 2 million records,
> the total aggregate number of words is about 60,000 depending on how you
> parse. For each word, you create a "bitmap object" (in one of a few
> forms) where bit '0' represents the first record, bit '1' represents the
> second, and so on, until you have 2 million bits.
>
> Set the correct bit in the bitmap for each document that contains that
> word. In the end you will have the equivalent 60,000 bitmaps or 2
> million bits.
>
> During search time, one creates an empty bitmap of 2 million bits as a
> work space. One parses the search term, and performs boolean operation
> on the workspace from the bitmap retrieved for each word.
>
> When you are done parsing, you have a bitmap with a bit set for each
> document that fits the search criteria. You then enumerate the bits by
> bit position, and you now have a list of document numbers.
>
> If only I could get the list of document numbers back into
> postgres....... It would be great.

Gotcha. It's impossible to return a set from a function, so the only
way to use perl to parse your bitmap. We did (in one project) external
search using suffix arrays which incredibly fast and use postgres to
return results to perl for processing.

Regards,

Oleg

> --
> http://www.mohawksoft.com
>

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message mlw 2000-12-09 14:28:37 Re: OK, does anyone have any better ideas?
Previous Message Hiroshi Inoue 2000-12-09 13:23:22 RE: pgsql/src/backend/commands (command.c vacuum.c)

Browse pgsql-novice by date

  From Date Subject
Next Message mlw 2000-12-09 14:28:37 Re: OK, does anyone have any better ideas?
Previous Message mlw 2000-12-09 13:18:10 Re: OK, does anyone have any better ideas?