Re: tsearch parser inefficiency if text includes urls or emails - new version

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-09 18:06:45
Message-ID: 200912091906.46404.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote:
> Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Could you show your testcase?
>
> OK. I was going to try to check other platforms first, and package
> up the information better, but here goes.
>
> I created 10000 lines with random IP-based URLs for a test. The
> first few lines are:
>
> create table t1 (c1 int not null primary key, c2 text);
> insert into t1 values (2,
> 'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.html
> http://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.html
> http://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html
I think you see no real benefit, because your strings are rather short - the
documents I scanned when noticing the issue where rather long.
If your strings are short, they and the copy will fit into cpu cache anyway, so
copying them around/converting them to some other string format is not that
expensive compared to the rest of the work done.

Also after each copying step for large strings the complete cache is filled
with unrelated information (namely the end of the string). So every charwise
access will need to wait for a memory access.

A rather extreme/contrived example:

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
1) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 3.740 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
1000) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 115.027 ms

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
10000) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 24355.339 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
20000) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 47276.739 ms

One easily can see the quadratic complexity here. The quadratic complexity
lies in the length/amount of emails/urls of the strings, not in the number of
to_tsvector calls!

After the patch:

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
20000) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 168.384 ms

I could not reproduce the slowdown you mentioned:

Without patch:
postgres=# SELECT to_tsvector('andres(at)anarazel(dot)de
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i)
ORDER BY g.i LIMIT 1;
to_tsvector
-------------------------------------------------------------------------------
'/1':4 'andres(at)anarazel(dot)de':1 'www.postgresql.org':3
'www.postgresql.org/1':2
(1 row)
Time: 1109.833 ms

With patch:
postgres=# SELECT to_tsvector('andres(at)anarazel(dot)de
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i)
ORDER BY g.i LIMIT 1;
to_tsvector
-------------------------------------------------------------------------------
'/1':4 'andres(at)anarazel(dot)de':1 'www.postgresql.org':3
'www.postgresql.org/1':2
(1 row)

Time: 1036.689 ms

So on the hardware I tried its even a little bit faster for small strings
(Older Xeon32bit, Core2 Duo, 64bit, Nehalem based Xeon 64bit).

I could not reproduce any difference with strings not involving urls or emails:

Without patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text)
FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1;
to_tsvector
--------------------------------------------------------
'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6
(1 row)

Time: 988.426 ms

With patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text)
FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1;
to_tsvector
--------------------------------------------------------
'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6
(1 row)

Time: 975.339 ms

So at least in my testing I do see no danger in the patch ;-)

Andres

PS: I averaged all the time result over multiple runs where it was relevant.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2009-12-09 18:18:01 Re: XLogInsert
Previous Message Peter Eisentraut 2009-12-09 18:00:22 Re: Need a mentor, and a project.