Re: text search: restricting the number of parsed words in headline generation

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: text search: restricting the number of parsed words in headline generation
Date: 2012-08-15 17:39:18
Message-ID: 1345052358.2737.0.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I will do the profiling and present the results.

On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > Is this a TODO?
>
> AFAIR nothing's been done about the speed issue, so yes. I didn't
> like the idea of creating a user-visible knob when the speed issue
> might be fixable with internal algorithm improvements, but we never
> followed up on this in either fashion.
>
> regards, tom lane
>
> > ---------------------------------------------------------------------------
>
> > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> >> Sushant Sinha <sushant354(at)gmail(dot)com> writes:
> >>> Doesn't this force the headline to be taken from the first N words of
> >>> the document, independent of where the match was? That seems rather
> >>> unworkable, or at least unhelpful.
> >>
> >>> In headline generation function, we don't have any index or knowledge of
> >>> where the match is. We discover the matches by first tokenizing and then
> >>> comparing the matches with the query tokens. So it is hard to do
> >>> anything better than first N words.
> >>
> >> After looking at the code in wparser_def.c a bit more, I wonder whether
> >> this patch is doing what you think it is. Did you do any profiling to
> >> confirm that tokenization is where the cost is? Because it looks to me
> >> like the match searching in hlCover() is at least O(N^2) in the number
> >> of tokens in the document, which means it's probably the dominant cost
> >> for any long document. I suspect that your patch helps not so much
> >> because it saves tokenization costs as because it bounds the amount of
> >> effort spent in hlCover().
> >>
> >> I haven't tried to do anything about this, but I wonder whether it
> >> wouldn't be possible to eliminate the quadratic blowup by saving more
> >> state across the repeated calls to hlCover(). At the very least, it
> >> shouldn't be necessary to find the last query-token occurrence in the
> >> document from scratch on each and every call.
> >>
> >> Actually, this code seems probably flat-out wrong: won't every
> >> successful call of hlCover() on a given document return exactly the same
> >> q value (end position), namely the last token occurrence in the
> >> document? How is that helpful?
> >>
> >> regards, tom lane
> >>
> >> --
> >> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-hackers
>
> > --
> > Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> > EnterpriseDB http://enterprisedb.com
>
> > + It's impossible for everything to be true. +
>
>
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2012-08-15 17:50:00 CREATE SCHEMA IF NOT EXISTS
Previous Message Tom Lane 2012-08-15 16:45:54 Re: text search: restricting the number of parsed words in headline generation