|From:||Andres Freund <andres(at)anarazel(dot)de>|
|Cc:||"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru|
|Subject:||Re: tsearch parser inefficiency if text includes urls or emails - new version|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote:
> I wrote:
> > Thanks for the sample which shows the difference.
> Ah, once I got on the right track, there is no problem seeing
> dramatic improvements with the patch. It changes some nasty O(N^2)
> cases to O(N). In particular, the fixes affect parsing of large
> strings encoded with multi-byte character encodings and containing
> email addresses or URLs with a non-IP-address host component. It
> strikes me as odd that URLs without a slash following the host
> portion, or with an IP address, are treated so differently in the
> parser, but if we want to address that, it's a matter for another
Same here. Generally I do have to say that I dont find that parser very nice -
and actually I think its not very efficient as well because it searches a big
part of the search space all the time.
I think a generated parser would be more efficient and way much easier to
> I'm inclined to think that the minimal differences found in some of
> my tests probably have more to do with happenstance of code
> alignment than the particulars of the patch.
Yes, I think that as well.
> I did find one significant (although easily solved) problem. In the
> patch, the recursive setup of usewide, pgwstr, and wstr are not
> conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
> version. Unless there's a good reason for that, the #ifdef should
> be added.
Uh. Sorry. Fixed.
> Less critical, but worth fixing one way or the other, TParserClose
> does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
> TParserCopyClose does. I think this should be consistent.
Actually there was *some* thinking behind that: The end of parsing is quite
obvious (the call returns, and its so verbose you will never do more than one
call anyway) - but knowing where the copy ends is quite important to
understand the parse flow.
I added a log there because in the end that line is not going to make any
> Sorry for the delay in review. I hope there's still time to get
> this committed in this CF.
Thanks for your reviewing!
Actually I dont mind very much if it gets delayed or not. Its trivial enough
that it shouldnt cause much work/conflicts/whatever next round and I am running
patched versions anyway, so ...
|Next Message||Kevin Grittner||2009-12-10 19:07:37||Re: explain output infelicity in psql|
|Previous Message||Kevin Grittner||2009-12-10 19:05:13||Re: tsearch parser inefficiency if text includes urls or emails - new version|