Re: [GENERAL] Fragments in tsearch2 headline

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-03 03:13:49
Message-ID: 1212462829.8047.38.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Efficiency: I realized that we do not need to store all norms. We need
to only store store norms that are in the query. So I moved the addition
of norms from addHLParsedLex to hlfinditem. This should add very little
memory overhead to existing headline generation.

If this is still not acceptable for default headline generation, then I
can push it into mark_hl_fragments. But I think any headline marking
function will benefit by having the norms corresponding to the query.

Why we need norms?

hlCover does the exact thing that Cover in tsrank does which is to find
the cover that contains the query. However hlcover has to go through
words that do not match the query. Cover on the other hand operates on
position indexes for just the query words and so it should be faster.

The main reason why I would I like it to be fast is that I want to
generate all covers for a given query. Then choose covers with smallest
length as they will be the one that will best explain relation of a
query to a document. Finally stretch those covers to the specified size.

In my understanding, the current headline generation tries to find the
biggest cover for display in the headline. I personally think that such
a cover does not explain the context of a query in a document. We may
differ on this and thats why we may need both options.

Let me know what you think on this patch and I will update the patch to
respect other options like MinWords and ShortWord.

NumFragments < 2:
I wanted people to use the new headline marker if they specify
NumFragments >= 1. If they do not specify the NumFragments or put it to
0 then the default marker will be used. This becomes a bit of tricky
parameter so please put in any idea on how to trigger the new marker.

On an another note I found that make_tsvector crashes if it receives a
ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
as 1 even when it gets 0 words. I am not sure if this is the desired
behavior.

-Sushant.

On Mon, 2008-06-02 at 18:10 +0400, Teodor Sigaev wrote:
> > I have attached a new patch with respect to the current cvs head. This
> > produces headline in a document for a given query. Basically it
> > identifies fragments of text that contain the query and displays them.
> New variant is much better, but...
>
> > HeadlineParsedText contains an array of actual words but not
> > information about the norms. We need an indexed position vector for each
> > norm so that we can quickly evaluate a number of possible fragments.
> > Something that tsvector provides.
>
> Why do you need to store norms? The single purpose of norms is identifying words
> from query - but it's already done by hlfinditem. It sets
> HeadlineWordEntry->item to corresponding QueryOperand in tsquery.
> Look, headline function is rather expensive and your patch adds a lot of extra
> work - at least in memory usage. And if user calls with NumFragments=0 the that
> work is unneeded.
>
> > This approach does not change any other interface and fits nicely with
> > the overall framework.
> Yeah, it's a really big step forward. Thank you. You are very close to
> committing except: Did you find a hlCover() function which produce a cover from
> original HeadlineParsedText representation? Is any reason to do not use it?
>
> >
> > The norms are converted into tsvector and a number of covers are
> > generated. The best covers are then chosen to be in the headline. The
> > covers are separated using a hardcoded coversep. Let me know if you want
> > to expose this as an option.
>
>
> >
> > Covers that overlap with already chosen covers are excluded.
> >
> > Some options like ShortWord and MinWords are not taken care of right
> > now. MaxWords are used as maxcoversize. Let me know if you would like to
> > see other options for fragment generation as well.
> ShortWord, MinWords and MaxWords should store their meaning, but for each
> fragment, not for the whole headline.
>
>
> >
> > Let me know any more changes you would like to see.
>
> if (num_fragments == 0)
> /* call the default headline generator */
> mark_hl_words(prs, query, highlight, shortword, min_words, max_words);
> else
> mark_hl_fragments(prs, query, highlight, num_fragments, max_words);
>
>
> Suppose, num_fragments < 2?
>

Attachment Content-Type Size
headlines_v0.3.patch text/x-patch 18.3 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ram Ravichandran 2008-06-03 03:40:37 Re: turning fsync off for WAL
Previous Message Kimball Johnson 2008-06-03 03:00:25 FW: make rows unique across db's without UUIP on windows?

Browse pgsql-hackers by date

  From Date Subject
Next Message Sushant Sinha 2008-06-03 03:28:12 Re: phrase search
Previous Message Martijn van Oosterhout 2008-06-02 21:28:18 Re: Case-Insensitve Text Comparison