Re: [GENERAL] Incorrect FTS result with GIN index

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [GENERAL] Incorrect FTS result with GIN index
Date: 2010-07-29 11:03:32
Message-ID: Pine.LNX.4.64.1007291459270.32129@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Tom,

we're not able to work on this right now, so go ahead if you have time.
I also wonder why did I get "right" result :) Just repeated the query:

test=# select count(*) from search_tab where (to_tsvector('german', keywords ) @@ to_tsquery('german', 'ee:* & dd:*'));
count
-------
123
(1 row)

Time: 26.185 ms

Oleg
On Wed, 28 Jul 2010, Tom Lane wrote:

> Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
>> you can download dump http://mira.sai.msu.su/~megera/tmp/search_tab.dump
>
> Hmm ... I'm not sure why you're failing to reproduce it, because it's
> falling over pretty easily for me. After poking at it for awhile,
> I am of the opinion that scanGetItem's handling of multiple keys is
> fundamentally broken and needs to be rewritten completely. The
> particular case I'm seeing here is that one key returns this sequence of
> TIDs/lossy flags:
>
> ...
> 1085/4 0
> 1086/65535 1
> 1087/4 0
> ...
>
> while the other one returns this:
>
> ...
> 1083/11 0
> 1086/6 0
> 1086/10 0
> 1087/10 0
> ...
>
> and what comes out of scanGetItem is just
>
> ...
> 1086/6 1
> ...
>
> because after returning that, on the next call it advances both input
> keystreams. So 1086/10 should be visited and is not.
>
> I think that depending on the previous entryRes state to determine what
> to do is basically unworkable, and what should probably be done instead
> is to remember the last-returned TID and advance keystreams with TIDs <=
> that. I haven't quite thought through how that should interact with
> lossy-page TIDs but it seems more robust than what we've got.
>
> I'm also noticing that the ANDing behavior for the "ee:* & dd:*" query
> style seems very much stupider than it needs to be --- it's returning
> lossy pages that very obviously don't need to be examined because the
> other keystream has no match at all on that page. But I haven't had
> time to probe into the reason why.
>
> I'm out of time for today, do you want to work on it?
>
> regards, tom lane
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ned Lilly 2010-07-29 12:10:12 Re: Which CMS/Ecommerce/Shopping cart ?
Previous Message Oleg Bartunov 2010-07-29 10:57:30 Re: Need help with full text index configuration

Browse pgsql-hackers by date

  From Date Subject
Next Message Boszormenyi Zoltan 2010-07-29 11:55:38 Re: lock_timeout GUC patch - Review
Previous Message Henk Enting 2010-07-29 10:57:19 patch for check constraints using multiple inheritance