From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
Subject: | Re: text search: restricting the number of parsed words in headline generation |
Date: | 2012-08-15 16:19:58 |
Message-ID: | 20120815161958.GK25473@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Is this a TODO?
---------------------------------------------------------------------------
On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> Sushant Sinha <sushant354(at)gmail(dot)com> writes:
> >> Doesn't this force the headline to be taken from the first N words of
> >> the document, independent of where the match was? That seems rather
> >> unworkable, or at least unhelpful.
>
> > In headline generation function, we don't have any index or knowledge of
> > where the match is. We discover the matches by first tokenizing and then
> > comparing the matches with the query tokens. So it is hard to do
> > anything better than first N words.
>
> After looking at the code in wparser_def.c a bit more, I wonder whether
> this patch is doing what you think it is. Did you do any profiling to
> confirm that tokenization is where the cost is? Because it looks to me
> like the match searching in hlCover() is at least O(N^2) in the number
> of tokens in the document, which means it's probably the dominant cost
> for any long document. I suspect that your patch helps not so much
> because it saves tokenization costs as because it bounds the amount of
> effort spent in hlCover().
>
> I haven't tried to do anything about this, but I wonder whether it
> wouldn't be possible to eliminate the quadratic blowup by saving more
> state across the repeated calls to hlCover(). At the very least, it
> shouldn't be necessary to find the last query-token occurrence in the
> document from scratch on each and every call.
>
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document? How is that helpful?
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2012-08-15 16:21:03 | Re: text search: restricting the number of parsed words in headline generation |
Previous Message | Joe Conway | 2012-08-15 16:18:48 | Re: sha1, sha2 functions into core? |