Re: tsearch parser inefficiency if text includes urls or emails - new version

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-hackers(at)postgresql(dot)org>
Cc: <greg(at)2ndquadrant(dot)com>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-10 18:10:24
Message-ID: 4B20E530020000250002D305@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:

> Thanks for the sample which shows the difference.

Ah, once I got on the right track, there is no problem seeing
dramatic improvements with the patch. It changes some nasty O(N^2)
cases to O(N). In particular, the fixes affect parsing of large
strings encoded with multi-byte character encodings and containing
email addresses or URLs with a non-IP-address host component. It
strikes me as odd that URLs without a slash following the host
portion, or with an IP address, are treated so differently in the
parser, but if we want to address that, it's a matter for another
patch.

I'm inclined to think that the minimal differences found in some of
my tests probably have more to do with happenstance of code
alignment than the particulars of the patch.

I did find one significant (although easily solved) problem. In the
patch, the recursive setup of usewide, pgwstr, and wstr are not
conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
version. Unless there's a good reason for that, the #ifdef should
be added.

Less critical, but worth fixing one way or the other, TParserClose
does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
TParserCopyClose does. I think this should be consistent.

Finally, there's that spelling error in the comment for
TParserCopyInit. Please fix.

If a patch is produced with fixes for these three things, I'd say
it'll be ready for committer. I'm marking it as Waiting on Author
for fixes to these three items.

Sorry for the delay in review. I hope there's still time to get
this committed in this CF.

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2009-12-10 18:12:52 Re: explain output infelicity in psql
Previous Message Tom Lane 2009-12-10 18:03:56 Re: Need --without-docs build switch