Re: [GENERAL] Fragments in tsearch2 headline

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-21 14:00:53
Message-ID: 1214056853.8689.10.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

I have an attached an updated patch with following changes:

1. Respects ShortWord and MinWords
2. Uses hlCover instead of Cover
3. Does not store norm (or lexeme) for headline marking
4. Removes ts_rank.h
5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
only counts the actual words and excludes spaces etc.

I have also changed NumFragments option to MaxFragments as there may not
be enough covers to display NumFragments.

Another change that I was thinking:

Right now if cover size > max_words then I just cut the trailing words.
Instead I was thinking that we should split the cover into more
fragments such that each fragment contains a few query words. Then each
fragment will not contain all query words but will show more occurrences
of query words in the headline. I would like to know what your opinion
on this is.

-Sushant.

On Thu, 2008-06-05 at 20:21 +0400, Teodor Sigaev wrote:
> > A couple of caveats:
> >
> > 1. ts_headline testing was done with current cvs head where as
> > headline_with_fragments was done with postgres 8.3.1.
> > 2. For headline_with_fragments, TSVector for the document was obtained
> > by joining with another table.
> > Are these differences understandable?
>
> That is possible situation because ts_headline has several criterias of 'best'
> covers - length, number of words from query, good words at the begin and at the
> end of headline while your fragment's algorithm takes care only on total number
> of words in all covers. It's not very good, but it's acceptable, I think.
> Headline (and ranking too) hasn't any formal rules to define is it good or bad?
> Just a people's opinions.
>
> Next possible reason: original algorithm had a look on all covers trying to find
> the best one while your algorithm tries to find just the shortest covers to fill
> a headline.
>
> But it's very desirable to use ShortWord - it's not very comfortable for user if
> one option produces unobvious side effect with another one.
> `
>
> > If you think these caveats are the reasons or there is something I am
> > missing, then I can repeat the entire experiments with exactly the same
> > conditions.
>
> Interesting for me test is a comparing hlCover with Cover in your patch, i.e.
> develop a patch which uses hlCover instead of Cover and compare old patch with
> new one.

Attachment Content-Type Size
headlines_v0.5.patch text/x-patch 11.0 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2008-06-21 16:56:13 Re: System in Recovery Mode But No Activity
Previous Message kevin kempter 2008-06-21 13:06:35 function question

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2008-06-21 14:08:43 Re: -head build error report
Previous Message Joshua D. Drake 2008-06-21 13:30:10 Re: -head build error report