Re: procost for to_tsvector

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: procost for to_tsvector
Date: 2015-03-11 21:54:48
Message-ID: 87h9trs0zm.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> Nyet ... at least not without you actually making that argument,
Tom> with numbers, rather than just handwaving. We use 100 for plpgsql
Tom> and suchlike functions. I'd be OK with making it 10 just on
Tom> general principles, but claiming that it's as expensive as a
Tom> plpgsql function requires evidence.

[TL/DR: 10 isn't enough, even 100 may be too low]

On a text corpus consisting of ~18 thousand blog comments + ~5% of dead
rows, median length 302 bytes, only about 3% long enough to be toasted,
and selecting a common word (~22% of the table):

explain analyze select * from comments
where to_tsvector('english',message) @@ '''one'''::tsquery;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on comments (cost=0.00..2406.18 rows=4140 width=792) (actual time=0.601..3946.589 rows=4056 loops=1)
Filter: (to_tsvector('english'::regconfig, message) @@ '''one'''::tsquery)
Rows Removed by Filter: 14310
Planning time: 0.270 ms
Execution time: 3954.745 ms
(5 rows)

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on comments (cost=204.09..2404.30 rows=4140 width=792) (actual time=2.401..11.564 rows=4056 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, message) @@ '''one'''::tsquery)
Heap Blocks: exact=1911
-> Bitmap Index Scan on comments_to_tsvector_idx (cost=0.00..203.05 rows=4140 width=0) (actual time=1.974..1.974 rows=4313 loops=1)
Index Cond: (to_tsvector('english'::regconfig, message) @@ '''one'''::tsquery)
Planning time: 0.278 ms
Execution time: 17.640 ms
(7 rows)

(strangely, the seqscan plan is picked despite having a cost more than a
point higher? what's up with that?)

So for two plans with virtually identical cost, we have an execution
time difference on the order of 200x.

We can rule out the performance of the @@ by using a precalculated
tsvector:

explain analyze select * from comments where tsv @@ '''one'''::tsquery;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on comments (cost=0.00..2359.31 rows=4140 width=792) (actual time=0.023..47.746 rows=4056 loops=1)
Filter: (tsv @@ '''one'''::tsquery)
Rows Removed by Filter: 14310
Planning time: 0.262 ms
Execution time: 54.220 ms
(5 rows)

So we're looking at an execution time for to_tsvector on the order of
200us, which is a seriously big deal when looking at a potential
seqscan. That's not just _as_ expensive as a plpgsql function, but more
than 50 times as expensive as a simple one like this:

create function f1(text) returns integer language plpgsql
as $f$ begin return length($1); end; $f$;

select sum(length(message)) from comments; -- 89ms
select sum(f1(message)) from comments; -- 155ms

66ms difference divided by 18366 rows = 3.6us per call

Now, obviously the default cost for plpgsql functions is assuming that
the function is a whole lot more complex than that, so one wouldn't
argue that to_tsvector should cost 5000. But there's a strong case for
arguing that it should cost a whole lot more than 100, because even at
that value the relative costs for the first two plans in this post only
differ by 2x, compared to a 200x runtime difference. A value of 10
would be inadequate in many cases; in this example it leaves the slower
plan with a cost only ~15% higher, which is way too close to be
comfortable.

(As another example, a function with a simple query in it, such as
obj_description, can have runtimes on the order of 40us, still 5x faster
than to_tsvector.)

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2015-03-11 22:14:05 Re: Parallel Seq Scan
Previous Message Jim Nasby 2015-03-11 21:50:09 Re: proposal: searching in array function - array_position