Skip site navigation (1) Skip section navigation (2)

Re: tsearch2 headline and postgresql.conf

From: pgsql-performance(at)nullmx(dot)com
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-performance(at)postgresql(dot)org
Subject: Re: tsearch2 headline and postgresql.conf
Date: 2006-01-22 20:29:01
Message-ID: 43D3EB0D.5020109@nullmx.com (view raw or flat)
Thread:
Lists: pgsql-performance
Oleg Bartunov wrote:

> You didn't provides us any query with explain analyze.
> Just to make sure you're fine.
>
>     Oleg
> On Sun, 22 Jan 2006, pgsql-performance(at)nullmx(dot)com wrote:
>
>> Hi folks,
>>
>> I'm not sure if this is the right place for this but thought I'd 
>> ask.  I'm relateively new to postgres having only used it on 3 
>> projects and am just delving into the setup and admin for the second 
>> time.
>>
>> I decided to try tsearch2 for this project's search requirements but 
>> am having trouble attaining adequate performance.  I think I've 
>> nailed it down to trouble with the headline() function in tsearch2. 
>> In short, there is a crawler that grabs HTML docs and places them in 
>> a database.  The search is done using tsearch2 pretty much installed 
>> according to instructions.  I have read a couple online guides 
>> suggested by this list for tuning the postgresql.conf file.  I only 
>> made modest adjustments because I'm not working with top-end hardware 
>> and am still uncertain of the actual impact of the different 
>> paramenters.
>>
>> I've been learning 'explain' and over the course of reading I have 
>> done enough query tweaking to discover the source of my headache 
>> seems to be headline().
>>
>> On a query of 429 documents, of which the avg size of the stripped 
>> down document as stored is 21KB, and the max is 518KB (an anomaly), 
>> tsearch2 performs exceptionally well returning most queries in about 
>> 100ms.
>>
>> On the other hand, following the tsearch2 guide which suggests 
>> returning that first portion as a subquery and then generating the 
>> headline() from those results, I see the query increase to 4 seconds!
>>
>> This seems to be directly related to document size.  If I filter out 
>> that 518KB doc along with some 100KB docs by returning "substring( 
>> stripped_text FROM 0 FOR 50000) AS stripped_text" I decrease the time 
>> to 1.4 seconds, but increase the risk of not getting a headline.
>>
>> Seeing as how this problem is directly tied to document size, I'm 
>> wondering if there are any specific settings in postgresql.conf that 
>> may help, or is this just a fact of life for the headline() 
>> function?  Or, does anyone know what the problem is and how to 
>> overcome it?
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________


------------------------

Hi Oleg,

Thanks for taking time to look at this.  Pardon my omission, I was 
writing that email rather late at night.

The following results from 'explain analyze' are from my
devlopment machine which is a dual PIII 600MHz running Debian
Linux and Postgres 8.1.2.  512 MB RAM.  The production machine
yields similar results but it is a virtual server so the
resources are rather unpredictable.  It is a quad processor and
has a larger result set in it's DB.


The original query is:
explain analyze
SELECT url, title, headline(stripped_text,q,
                      'MaxWords=75, MinWords=25, 
StartSel=!!!REPLACE_ME!!!,StopSel=!!!/REPLACE_ME!!!'),
    rank, to_char(timezone('CST', date_last_changed), 'DD Mon YYYY') AS 
date_last_changed
FROM
( SELECT url_id, url, title, stripped_text, date_last_changed, q, 
rank(index_text, q) AS rank
  FROM (web_page w LEFT JOIN url u USING (url_id)), 
to_tsquery('big&search') AS q
  WHERE (index_text <> '') AND (index_text @@ q) AND (w.url_id NOT IN (1,2))
  AND (url NOT LIKE '%badurl.com%')
  ORDER BY rank DESC, date_last_changed DESC
  LIMIT 10 OFFSET 0
) AS useless
;


...and the resultant output of EXPLAIN ANALYZE is:

 Subquery Scan useless  (cost=8.02..8.04 rows=1 width=624) (actual 
time=769.131..2769.320 rows=10 loops=1)
   ->  Limit  (cost=8.02..8.02 rows=1 width=282) (actual 
time=566.798..566.932 rows=10 loops=1)
         ->  Sort  (cost=8.02..8.02 rows=1 width=282) (actual 
time=566.792..566.870 rows=10 loops=1)
               Sort Key: rank(w.index_text, q.q), w.date_last_changed
               ->  Nested Loop  (cost=2.00..8.01 rows=1 width=282) 
(actual time=4.068..563.128 rows=178 loops=1)
                     ->  Nested Loop  (cost=2.00..4.96 rows=1 width=221) 
(actual time=3.179..388.610 rows=179 loops=1)
                           ->  Function Scan on q  (cost=0.00..0.01 
rows=1 width=32) (actual time=0.025..0.028 rows=1 loops=1)
                           ->  Bitmap Heap Scan on web_page w  
(cost=2.00..4.94 rows=1 width=189) (actual time=3.123..387.547 rows=179 
loops=1)
                                 Filter: ((w.index_text <> ''::tsvector) 
AND (w.url_id <> 1) AND (w.url_id <> 2) AND (w.index_text @@ "outer".q))
                                 ->  Bitmap Index Scan on 
idx_index_text  (cost=0.00..2.00 rows=1 width=0) (actual 
time=1.173..1.173 rows=277 loops=1)
                                       Index Cond: (w.index_text @@ 
"outer".q)
                     ->  Index Scan using pk_url on url u  
(cost=0.00..3.03 rows=1 width=65) (actual time=0.044..0.049 rows=1 
loops=179)
                           Index Cond: ("outer".url_id = u.url_id)
                           Filter: (url !~~ '%badurl.com%'::text)
 Total runtime: 2771.023 ms
(15 rows)
-----
Maybe someone can help me with interpreting the ratio of cost to time 
too.  Do they look appropriate?

To give some further data, I've stripped down the query to only return 
what's necessary for examination here.


explain analyze
SELECT headline(stripped_text,q)
FROM
( SELECT stripped_text, q
  FROM (web_page w LEFT JOIN url u USING (url_id)), 
to_tsquery('big&search') AS q
  WHERE (index_text <> '') AND (index_text @@ q) AND (w.url_id NOT IN (1,2))
  LIMIT 10 OFFSET 0
) AS useless
;
 Subquery Scan useless  (cost=43.25..75.50 rows=1 width=64) (actual 
time=383.720..2066.962 rows=10 loops=1)
   ->  Limit  (cost=43.25..75.49 rows=1 width=129) (actual 
time=236.814..258.150 rows=10 loops=1)
         ->  Nested Loop  (cost=43.25..75.49 rows=1 width=129) (actual 
time=236.807..258.070 rows=10 loops=1)
               Join Filter: ("inner".index_text @@ "outer".q)
               ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) 
(actual time=0.033..0.033 rows=1 loops=1)
               ->  Merge Right Join  (cost=43.25..70.37 rows=409 
width=151) (actual time=235.603..237.283 rows=31 loops=1)
                     Merge Cond: ("outer".url_id = "inner".url_id)
                     ->  Index Scan using pk_url on url u  
(cost=0.00..806.68 rows=30418 width=4) (actual time=0.029..0.731 rows=39 
loops=1)
                     ->  Sort  (cost=43.25..44.27 rows=409 width=155) 
(actual time=235.523..235.654 rows=31 loops=1)
                           Sort Key: w.url_id
                           ->  Seq Scan on web_page w  (cost=0.00..25.51 
rows=409 width=155) (actual time=0.037..230.577 rows=409 loops=1)
                                 Filter: ((index_text <> ''::tsvector) 
AND (url_id <> 1) AND (url_id <> 2))
 Total runtime: 2081.569 ms
(13 rows)

As a demonstration to note the effect of document size on headline speed 
I've returned only the first 20,000 characters of each document in the 
subquery shown below.  20K is the size of avg document from the results 
with 518K being max (an anomaly), and a few 100K documents.

explain analyze
SELECT headline(stripped_text,q)
FROM
( SELECT substring(stripped_text FROM 0 FOR 20000) AS stripped_text, q
  FROM (web_page w LEFT JOIN url u USING (url_id)), 
to_tsquery('big&search') AS q
  WHERE (index_text <> '') AND (index_text @@ q) AND (w.url_id NOT IN (1,2))
  LIMIT 10 OFFSET 0
) AS useless
;

 Subquery Scan useless  (cost=43.25..75.51 rows=1 width=64) (actual 
time=316.049..906.045 rows=10 loops=1)
   ->  Limit  (cost=43.25..75.49 rows=1 width=129) (actual 
time=239.831..295.151 rows=10 loops=1)
         ->  Nested Loop  (cost=43.25..75.49 rows=1 width=129) (actual 
time=239.825..295.068 rows=10 loops=1)
               Join Filter: ("inner".index_text @@ "outer".q)
               ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) 
(actual time=0.021..0.021 rows=1 loops=1)
               ->  Merge Right Join  (cost=43.25..70.37 rows=409 
width=151) (actual time=234.711..236.265 rows=31 loops=1)
                     Merge Cond: ("outer".url_id = "inner".url_id)
                     ->  Index Scan using pk_url on url u  
(cost=0.00..806.68 rows=30418 width=4) (actual time=0.046..0.715 rows=39 
loops=1)
                     ->  Sort  (cost=43.25..44.27 rows=409 width=155) 
(actual time=234.613..234.731 rows=31 loops=1)
                           Sort Key: w.url_id
                           ->  Seq Scan on web_page w  (cost=0.00..25.51 
rows=409 width=155) (actual time=0.030..229.788 rows=409 loops=1)
                                 Filter: ((index_text <> ''::tsvector) 
AND (url_id <> 1) AND (url_id <> 2))
 Total runtime: 907.397 ms
(13 rows)

And finally, returning the whole document without the intervention of 
headline:

explain analyze
SELECT stripped_text
FROM
( SELECT stripped_text, q
  FROM (web_page w LEFT JOIN url u USING (url_id)), 
to_tsquery('big&search') AS q
  WHERE (index_text <> '') AND (index_text @@ q) AND (w.url_id NOT IN (1,2))
  LIMIT 10 OFFSET 0
) AS useless
;

 Subquery Scan useless  (cost=43.25..75.50 rows=1 width=32) (actual 
time=235.218..253.048 rows=10 loops=1)
   ->  Limit  (cost=43.25..75.49 rows=1 width=129) (actual 
time=235.210..252.994 rows=10 loops=1)
         ->  Nested Loop  (cost=43.25..75.49 rows=1 width=129) (actual 
time=235.204..252.953 rows=10 loops=1)
               Join Filter: ("inner".index_text @@ "outer".q)
               ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) 
(actual time=0.024..0.024 rows=1 loops=1)
               ->  Merge Right Join  (cost=43.25..70.37 rows=409 
width=151) (actual time=234.008..234.766 rows=31 loops=1)
                     Merge Cond: ("outer".url_id = "inner".url_id)
                     ->  Index Scan using pk_url on url u  
(cost=0.00..806.68 rows=30418 width=4) (actual time=0.058..0.344 rows=39 
loops=1)
                     ->  Sort  (cost=43.25..44.27 rows=409 width=155) 
(actual time=233.896..233.952 rows=31 loops=1)
                           Sort Key: w.url_id
                           ->  Seq Scan on web_page w  (cost=0.00..25.51 
rows=409 width=155) (actual time=0.031..229.057 rows=409 loops=1)
                                 Filter: ((index_text <> ''::tsvector) 
AND (url_id <> 1) AND (url_id <> 2))
 Total runtime: 254.057 ms
(13 rows)


Again, I really appreciate any help you folks can give with this.




In response to

pgsql-performance by date

Next:From: MarcosDate: 2006-01-23 08:14:14
Subject: Re: [PERFORMANCE] Stored Procedures
Previous:From: August ZajoncDate: 2006-01-22 16:04:29
Subject: Re: Suspending SELECTs

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group