Re: [GENERAL] Fragments in tsearch2 headline

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-18 01:16:24
Message-ID: 1216343784.10058.5.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Fixed some off by one errors pointed by Oleg and errors in excluding
overlapping fragments.

Also adding test queries and updating regression tests.

Let me know of any other changes that are needed.

-Sushant.

On Thu, 2008-07-17 at 03:28 +0400, Oleg Bartunov wrote:
> On Wed, 16 Jul 2008, Sushant Sinha wrote:
>
> > I will add test queries and their results for the corner cases in a
> > separate file. I guess the only thing I am confused about is what should
> > be the behavior of headline generation when Query items have words of
> > size less than ShortWord. I guess the answer is to ignore ShortWord
> > parameter but let me know if the answer is any different.
> >
>
> ShortWord is about headline text, it doesn't affects words in query,
> so you can't discard them from query.
>
> > -Sushant.
> >
> > On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote:
> >> Sushant,
> >>
> >> first, please, provide simple test queries, which demonstrate the right work
> >> in the corner cases. This will helps reviewers to test your patch and
> >> helps you to make sure your new version is ok. For example:
> >>
> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery);
> >> ts_headline
> >> ------------------------------------------------------
> >> <b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b>
> >>
> >> This select breaks your code:
> >>
> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2');
> >> ts_headline
> >> --------------
> >> ... 2 ...
> >>
> >> and so on ....
> >>
> >>
> >> Oleg
> >> On Tue, 15 Jul 2008, Sushant Sinha wrote:
> >>
> >>> Attached a new patch that:
> >>>
> >>> 1. fixes previous bug
> >>> 2. better handles the case when cover size is greater than the MaxWords.
> >>> Basically it divides a cover greater than MaxWords into fragments of
> >>> MaxWords, resizes each such fragment so that each end of the fragment
> >>> contains a query word and then evaluates best fragments based on number of
> >>> query words in each fragment. In case of tie it picks up the smaller
> >>> fragment. This allows more query words to be shown with multiple fragments
> >>> in case a single cover is larger than the MaxWords.
> >>>
> >>> The resizing of a fragment such that each end is a query word provides room
> >>> for stretching both sides of the fragment. This (hopefully) better presents
> >>> the context in which query words appear in the document. If a cover is
> >>> smaller than MaxWords then the cover is treated as a fragment.
> >>>
> >>> Let me know if you have any more suggestions or anything is not clear.
> >>>
> >>> I have not yet added the regression tests. The regression test suite seemed
> >>> to be only ensuring that the function works. How many tests should I be
> >>> adding? Is there any other place that I need to add different test cases for
> >>> the function?
> >>>
> >>> -Sushant.
> >>>
> >>>
> >>> Nice. But it will be good to resolve following issues:
> >>>> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
> >>>> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and load in db.
> >>>>
> >>>> Queries
> >>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> >>>> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
> >>>>
> >>>> and
> >>>>
> >>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> >>>> from apod;
> >>>>
> >>>> crash postgresql :(
> >>>>
> >>>> 2) pls, include in your patch documentation and regression tests.
> >>>>
> >>>>
> >>>>> Another change that I was thinking:
> >>>>>
> >>>>> Right now if cover size > max_words then I just cut the trailing words.
> >>>>> Instead I was thinking that we should split the cover into more
> >>>>> fragments such that each fragment contains a few query words. Then each
> >>>>> fragment will not contain all query words but will show more occurrences
> >>>>> of query words in the headline. I would like to know what your opinion
> >>>>> on this is.
> >>>>>
> >>>>
> >>>> Agreed.
> >>>>
> >>>>
> >>>> --
> >>>> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> >>>> WWW:
> >>>> http://www.sigaev.ru/
> >>>>
> >>>
> >>
> >> Regards,
> >> Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

Attachment Content-Type Size
headlines_v0.8.patch text/x-patch 12.7 KB
headlines_test.txt text/x-vhdl 8.3 KB
headlines_regressv0.2.patch text/x-patch 4.5 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Francisco Reyes 2008-07-18 01:21:13 Reducing memory usage of insert into select operations?
Previous Message Klint Gore 2008-07-17 23:41:23 Re: query optimization

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2008-07-18 01:41:20 Re: [PATCHES] WITH RECUSIVE patches 0717
Previous Message Jonah H. Harris 2008-07-18 00:40:45 Re: [PATCH]-hash index improving