Re: [GENERAL] Fragments in tsearch2 headline

From: "Sushant Sinha" <sushant354(at)gmail(dot)com>
To: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc: "Pierre-Yves Strub" <pierre(dot)yves(dot)strub(at)gmail(dot)com>, "Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-15 04:50:29
Message-ID: 9fb559330807142150m75fa325fv52f161e6857a712d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Attached a new patch that:

1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords.
Basically it divides a cover greater than MaxWords into fragments of
MaxWords, resizes each such fragment so that each end of the fragment
contains a query word and then evaluates best fragments based on number of
query words in each fragment. In case of tie it picks up the smaller
fragment. This allows more query words to be shown with multiple fragments
in case a single cover is larger than the MaxWords.

The resizing of a fragment such that each end is a query word provides room
for stretching both sides of the fragment. This (hopefully) better presents
the context in which query words appear in the document. If a cover is
smaller than MaxWords then the cover is treated as a fragment.

Let me know if you have any more suggestions or anything is not clear.

I have not yet added the regression tests. The regression test suite seemed
to be only ensuring that the function works. How many tests should I be
adding? Is there any other place that I need to add different test cases for
the function?

-Sushant.

Nice. But it will be good to resolve following issues:
> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and load in db.
>
> Queries
> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
>
> and
>
> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> from apod;
>
> crash postgresql :(
>
> 2) pls, include in your patch documentation and regression tests.
>
>
>> Another change that I was thinking:
>>
>> Right now if cover size > max_words then I just cut the trailing words.
>> Instead I was thinking that we should split the cover into more
>> fragments such that each fragment contains a few query words. Then each
>> fragment will not contain all query words but will show more occurrences
>> of query words in the headline. I would like to know what your opinion
>> on this is.
>>
>
> Agreed.
>
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW:
> http://www.sigaev.ru/
>

Attachment Content-Type Size
headlines_v0.6.patch text/x-diff 12.6 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrew Maclean 2008-07-15 04:52:23 Backing up and deleting a database.
Previous Message Harvey, Allan AC 2008-07-15 04:48:03 Re: 8.3.3 Complie issue

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-07-15 04:51:31 Re: [PATCHES] WIP: executor_hook for pg_stat_statements
Previous Message Bruce Momjian 2008-07-15 03:35:59 Re: DROP ROLE dependency tracking ...