Quick Links

Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	teodor(at)sigaev(dot)ru
Subject:	Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
Date:	2009-11-08 16:00:53
Message-ID:	200911081700.53726.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
> While playing around/evaluating tsearch I notices that to_tsvector is
> obscenely slow for some files. After some profiling I found that this is
> due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
> a multibyte encoding is in use TParserInit copies the whole remaining
> input and converts it to wchar_t or pg_wchar - for every email or protocol
> prefixed url in the the document. Which obviously is bad.
>
> I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
> which reuses the the already converted strings of the original TParser -
> only at different offsets.
>
> Another approach would be to get rid of the separate parser invocations -
> requiring a bunch of additional states. This seemed more complex to me, so
> I wanted to get some feedback first.
>
> Without patch:
> andres=# SELECT to_tsvector('english', document) FROM document WHERE
> filename = '/usr/share/doc/libdrm-nouveau1/changelog';
>
> ──────────────────────────────────────────────────────────────────────────
> ─────────────────────────── ...
> (1 row)
>
> Time: 5835.676 ms
>
> With patch:
> andres=# SELECT to_tsvector('english', document) FROM document WHERE
> filename = '/usr/share/doc/libdrm-nouveau1/changelog';
>
> ──────────────────────────────────────────────────────────────────────────
> ─────────────────────────── ...
> (1 row)
>
> Time: 395.341 ms
>
> Ill cleanup the patch if it seems like a sensible solution...
As nobody commented here is a corrected (stupid thinko) and cleaned up
version. Anyone cares to comment whether I am the only one thinking this is an
issue?

Andres

Attachment	Content-Type	Size
0001-Fix-TSearch-inefficiency-because-of-repeated-copying.patch	text/x-patch	3.2 KB

In response to

[PATCH] tsearch parser inefficiency if text includes urls or emails at 2009-11-01 15:19:43 from Andres Freund

Responses

Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version at 2009-11-08 16:41:15 from Kenneth Marshall
Re: tsearch parser inefficiency if text includes urls or emails - new version at 2009-11-14 00:03:33 from Kevin Grittner

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kenneth Marshall	2009-11-08 16:41:15	Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
Previous Message	Tom Lane	2009-11-08 01:45:00	Re: Specific names for plpgsql variable-resolution control options?