From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Robert John Shepherd <robert(at)reviewer(dot)co(dot)uk>
Cc: openfts discussion <openfts-general(at)lists(dot)sourceforge(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:
Date: 2005-01-25 20:21:30
Message-ID: Pine.GSO.4.62.0501252217060.6363@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 24 Jan 2005, Robert John Shepherd wrote:

> Anyone else had problems with the performance of Tsearch2 when you reduce
> the words significantly in the stopword list? Tsearch1 was the same.
>
> Not really understanding Tsearch beyond installing and using it, I can't
> really be sure if this is simply because of the amount of memory required
> for indexes etc goes up a large amount if the stopwords are reduced.
>
> But it is a horrendous problem, and is either created (or made much worse)
> with every update/insert into tables that are full text indexed. :/
>

May I ask you (my reader) to correct my wording and arrange as HTML
document, so I could put it on tsearch2 page - I'm afraid I'll be able to
explain tsearch2 internals again :)
I'll answer your questions and hope we'll get nice document.
Illustrations are welcome !

Full text search index is lossy, i.e., tuples returned by search should
be checked for 'false drops'. Probability of 'false drops' is depends on
several factors and the number of unique words is one of them.
Document represented as a bit string with '1' in positions to which words are
hashed. Because of hashing there is a chance some words hashed to the
same position. These bit strings are stored in RD-tree (Russian Doll tree),
invented by Hellerstein, where parent is 'OR'-ed bit-strings of all
children. It's clear, that parents tend to be full of '1' (degenerates)
and become quite
useless because of it's little selectivity. Searching performs as a bit comparison
of bit string represents query and RD-tree entry. If all '1' of both
bit strings are in the same position we say that this branch *probably*
contains query, but even if there is one discrepance we could *definitly*
reject this branch. That's where we get saving and where we again
get probablity of false drop. So we have two reasons, why we need to mark
index as lossy and check search results for 'false drops'. The nature
of these reasons are different - first is exists because of limited
length of bit string represent document (the number of unique words influence
on the 'quality' of bit string) , while second - because of the way
we store bit strings (the number of documents influence on the 'quality'
of index). Of course, implicitly, index 'quality' depends on 'quality'
of each bit string, so both reasons have influence on the performance
of search. Checking for 'false drops' could be very costly operations, because
we should read tuple from disk and check if it satisfies query which
performs as a direct comparison of query and tsvector (this operation
is fast because tsvector is a sort of ordered array of stems)

Moral of my story: keep the number of unique words small.

Where is a solution ? Next time I could post a proposals for
scalable full text search in PostgreSQL. Just ask me :)
Seriously, we've done a prototype of tsearch2 daemon and are looking
for sponsorships.

>
>
> Yours Unwhettedly,
> Robert John Shepherd.
>
> Editor
> DVD REVIEWER
> The UK's BIGGEST Online DVD Magazine
> http://www.dvd.reviewer.co.uk3~3~3~3~3~3~3~3~3~
>
> For a copy of my Public PGP key, email: pgp(at)robertsworld(dot)org(dot)uk
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
> Tool for open source databases. Create drag-&-drop reports. Save time
> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
> Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> _______________________________________________
> OpenFTS-general mailing list
> OpenFTS-general(at)lists(dot)sourceforge(dot)net
> https://lists.sourceforge.net/lists/listinfo/openfts-general
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Browse pgsql-hackers by date

  From Date Subject
Next Message elein 2005-01-25 20:34:57 Vacuum Looping 7.4
Previous Message Merlin Moncure 2005-01-25 20:06:33 Re: [HACKERS] userlock changes for 8.1/8.2