Re: [GENERAL] Fragments in tsearch2 headline

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-05-31 23:58:41
Message-ID: 1212278321.5891.24.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

I have attached a new patch with respect to the current cvs head. This
produces headline in a document for a given query. Basically it
identifies fragments of text that contain the query and displays them.

DESCRIPTION

HeadlineParsedText contains an array of actual words but not
information about the norms. We need an indexed position vector for each
norm so that we can quickly evaluate a number of possible fragments.
Something that tsvector provides.

So this patch changes HeadlineParsedText to contain the norms
(ParsedText). This field is updated while parsing in hlparsetext. The
position information of the norms corresponds to the position of words
in HeadlineParsedText (not to the norms positions as is the case in
tsvector). This works correctly with the current parser. If you think
there may be issues with other parsers please let me know.

This approach does not change any other interface and fits nicely with
the overall framework.

The norms are converted into tsvector and a number of covers are
generated. The best covers are then chosen to be in the headline. The
covers are separated using a hardcoded coversep. Let me know if you want
to expose this as an option.

Covers that overlap with already chosen covers are excluded.

Some options like ShortWord and MinWords are not taken care of right
now. MaxWords are used as maxcoversize. Let me know if you would like to
see other options for fragment generation as well.

Let me know any more changes you would like to see.

-Sushant.

On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote:
> Hi!
>
> > 1. Why is hlparsetext used to parse the document rather than the
> > parsetext function? Since words to be included in the headline will be
> > marked afterwords, it seems more reasonable to just use the parsetext
> > function.
> > The main difference I see is the use of hlfinditem and marking whether
> > some word is repeated.
> hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext
> doesn't.
> hlparsetext preserves original form of lexemes. parsetext doesn't.
>
> >
> > The reason this is important is that hlparsetext does not seem to be
> > storing word positions which parsetext does. The word positions are
> > important for generating headline with fragments.
> Doesn't needed - hlparsetext preserves the whole text, so, position is a number
> of array.
>
> >
> > 2.
> >> I would prefer the signature ts_headline( [regconfig,] text, tsquery
> >> [,text] )and function should accept 'NumFragments=>N' for default
> >> parser. Another parsers may use another options.
> >
> > Does this mean we want a unified function ts_headline and we trigger the
> > fragments if NumFragments is specified?
>
> Trigger should be inside parser-specific function (pg_ts_parser.prsheadline).
> Another parsers might not recognize that option.
>
> > It seems that introducing a new
> > function which can take configuration OID, or name is complex as there
> > are so many functions handling these issues in wparser.c.
> No, of course - ts_headline takes care about finding configuration and calling
> correct parser.
>
> >
> > If this is true then we need to just add marking of headline words in
> > prsd_headline. Otherwise we will need another prsd_headline_with_covers
> > function.
> Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change
> an array of HeadlineParsedText.
>
> >
> > 3. In many cases people may already have TSVector for a given document
> > (for search operation). Would it be faster to pass TSVector to headline
> > function when compared to computing TSVector each time? If that is the
> > case then should we have an option to pass TSVector to headline
> > function?
> As I mentioned above, tsvector doesn;t contain whole information about text.
>

Attachment Content-Type Size
headlines_v0.2.patch text/x-patch 18.2 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martin 2008-06-01 00:57:59 Re: Converting empty input strings to Nulls
Previous Message Jeff Davis 2008-05-31 18:54:25 Re: Converting empty input strings to Nulls

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2008-06-01 01:10:06 synchronized scans for VACUUM
Previous Message Greg Sabino Mullane 2008-05-31 23:32:48 Re: Overhauling GUCS