Skip site navigation (1) Skip section navigation (2)

Re: tsearch parser inefficiency if text includes urls or emails - new version

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-10 19:05:15
Message-ID: 200912102005.16560.andres@anarazel.de (view raw or flat)
Thread:
Lists: pgsql-hackers
On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote:
> I wrote:
> > Thanks for the sample which shows the difference.
> 
> Ah, once I got on the right track, there is no problem seeing
> dramatic improvements with the patch.  It changes some nasty O(N^2)
> cases to O(N).  In particular, the fixes affect parsing of large
> strings encoded with multi-byte character encodings and containing
> email addresses or URLs with a non-IP-address host component.  It
> strikes me as odd that URLs without a slash following the host
> portion, or with an IP address, are treated so differently in the
> parser, but if we want to address that, it's a matter for another
> patch.
Same here. Generally I do have to say that I dont find that parser very nice - 
and actually I think its not very efficient as well because it searches a big 
part of the search space all the time.
I think a generated parser would be more efficient and way much easier to 
read...

> I'm inclined to think that the minimal differences found in some of
> my tests probably have more to do with happenstance of code
> alignment than the particulars of the patch.
Yes, I think that as well.

> I did find one significant (although easily solved) problem.  In the
> patch, the recursive setup of usewide, pgwstr, and wstr are not
> conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
> version.  Unless there's a good reason for that, the #ifdef should
> be added.
Uh. Sorry. Fixed.

> Less critical, but worth fixing one way or the other, TParserClose
> does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
> TParserCopyClose does.  I think this should be consistent.
Actually there was *some* thinking behind that: The end of parsing is quite 
obvious (the call returns, and its so verbose you will never do more than one 
call anyway) - but knowing where the copy ends is quite important to 
understand the parse flow.
I added a log there because in the end that line is not going to make any 
difference ;-)

> Sorry for the delay in review.  I hope there's still time to get
> this committed in this CF.
Thanks for your reviewing!

Actually I dont mind very much if it gets delayed or not. Its trivial enough 
that it shouldnt cause much work/conflicts/whatever next round and I am running 
patched versions anyway, so ...



Andres

Attachment: 0001-Fix-TSearch-inefficiency-because-of-repeated-copying.patch
Description: text/x-patch (3.9 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Kevin GrittnerDate: 2009-12-10 19:07:37
Subject: Re: explain output infelicity in psql
Previous:From: Kevin GrittnerDate: 2009-12-10 19:05:13
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group