Re: tsearch parser inefficiency if text includes urls or emails - new version

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-10 19:05:15
Message-ID: 200912102005.16560.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote:
> I wrote:
> > Thanks for the sample which shows the difference.
>
> Ah, once I got on the right track, there is no problem seeing
> dramatic improvements with the patch. It changes some nasty O(N^2)
> cases to O(N). In particular, the fixes affect parsing of large
> strings encoded with multi-byte character encodings and containing
> email addresses or URLs with a non-IP-address host component. It
> strikes me as odd that URLs without a slash following the host
> portion, or with an IP address, are treated so differently in the
> parser, but if we want to address that, it's a matter for another
> patch.
Same here. Generally I do have to say that I dont find that parser very nice -
and actually I think its not very efficient as well because it searches a big
part of the search space all the time.
I think a generated parser would be more efficient and way much easier to
read...

> I'm inclined to think that the minimal differences found in some of
> my tests probably have more to do with happenstance of code
> alignment than the particulars of the patch.
Yes, I think that as well.

> I did find one significant (although easily solved) problem. In the
> patch, the recursive setup of usewide, pgwstr, and wstr are not
> conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
> version. Unless there's a good reason for that, the #ifdef should
> be added.
Uh. Sorry. Fixed.

> Less critical, but worth fixing one way or the other, TParserClose
> does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
> TParserCopyClose does. I think this should be consistent.
Actually there was *some* thinking behind that: The end of parsing is quite
obvious (the call returns, and its so verbose you will never do more than one
call anyway) - but knowing where the copy ends is quite important to
understand the parse flow.
I added a log there because in the end that line is not going to make any
difference ;-)

> Sorry for the delay in review. I hope there's still time to get
> this committed in this CF.
Thanks for your reviewing!

Actually I dont mind very much if it gets delayed or not. Its trivial enough
that it shouldnt cause much work/conflicts/whatever next round and I am running
patched versions anyway, so ...

Andres

Attachment Content-Type Size
0001-Fix-TSearch-inefficiency-because-of-repeated-copying.patch text/x-patch 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2009-12-10 19:07:37 Re: explain output infelicity in psql
Previous Message Kevin Grittner 2009-12-10 19:05:13 Re: tsearch parser inefficiency if text includes urls or emails - new version