Skip site navigation (1) Skip section navigation (2)

Re: [GENERAL] Fragments in tsearch2 headline

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-04 00:14:32
Message-ID: 1212538472.5848.29.camel@dragflick (view raw or flat)
Thread:
Lists: pgsql-generalpgsql-hackers
My main argument for using Cover instead of hlCover was that Cover will
be faster. I tested the default headline generation that uses hlCover
with the current patch that uses Cover. There was not much difference.
So I think you are right in that we do not need norms and we can just
use hlCover.

I also compared performance of ts_headline with my first patch to
headline generation (one which was a separate function and took tsvector
as input). The performance was dramatically different. For one query
ts_headline took roughly 200 ms while headline_with_fragments took just
70 ms. On an another query ts_headline took 76 ms while
headline_with_fragments took 24 ms. You can find 'explain analyze' for
the first query at the bottom of the page. 

These queries were run multiple times to ensure that I never hit the
disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running
Linux 2.6.22-gentoo-r8.

A couple of caveats: 

1. ts_headline testing was done with current cvs head where as
headline_with_fragments was done with postgres 8.3.1.

2. For headline_with_fragments, TSVector for the document was obtained
by joining with another table.

Are these differences understandable?

If you think these caveats are the reasons or there is something I am
missing, then I can repeat the entire experiments with exactly the same
conditions. 

-Sushant.


Here is 'explain analyze' for both the functions:


ts_headline
------------

lawdb=# explain analyze SELECT ts_headline('english', doc, q, '')
            FROM    docraw, plainto_tsquery('english', 'freedom of
speech') as q
            WHERE   docraw.tid = 125596;
                                                         QUERY
PLAN                                                         

 Nested Loop  (cost=0.00..8.31 rows=1 width=497) (actual
time=199.692..200.207 rows=1 loops=1)
   ->  Index Scan using docraw_pkey on docraw  (cost=0.00..8.29 rows=1
width=465) (actual time=0.041..0.065 rows=1 loops=1)
         Index Cond: (tid = 125596)
   ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.010..0.014 rows=1 loops=1)
 Total runtime: 200.311 ms


headline_with_fragments
-----------------------

lawdb=# explain analyze SELECT headline_with_fragments('english',
docvector, doc, q, 'MaxWords=40')
            FROM    docraw, docmeta, plainto_tsquery('english', 'freedom
of speech') as q
            WHERE   docraw.tid = 125596 and docmeta.tid=125596;
                                                             QUERY
PLAN                                                             
----------------------
 Nested Loop  (cost=0.00..16.61 rows=1 width=883) (actual
time=70.564..70.949 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..16.59 rows=1 width=851) (actual
time=0.064..0.094 rows=1 loops=1)
         ->  Index Scan using docraw_pkey on docraw  (cost=0.00..8.29
rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1)
               Index Cond: (tid = 125596)
         ->  Index Scan using docmeta_pkey on docmeta  (cost=0.00..8.29
rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1)
               Index Cond: (docmeta.tid = 125596)
   ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.012..0.016 rows=1 loops=1)
 Total runtime: 71.076 ms
(8 rows)


On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote:
> > Why we need norms?
> 
> We don't need norms at all - all matched HeadlineWordEntry already marked by 
> HeadlineWordEntry->item! If it equals to NULL then this word isn't contained in 
> tsquery.
> 
> > hlCover does the exact thing that Cover in tsrank does which is to find
> > the  cover that contains the query. However hlcover has to go through
> > words that do not match the query. Cover on the other hand operates on
> > position indexes for just the query words and so it should be faster. 
> Cover, by definition, is a minimal continuous text's piece matched by query. May 
> be a several covers in text and hlCover will find all of them. Next, 
> prsd_headline() (for now) tries to define the best one. "Best" means: cover 
> contains a lot of words from query, not less that MinWords, not greater than 
> MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc.
> > 
> > The main reason why I would I like it to be fast is that I want to
> > generate all covers for a given query. Then choose covers with smallest
> hlCover generates all covers.
> 
> > Let me know what you think on this patch and I will update the patch to
> > respect other options like MinWords and ShortWord. 
> 
> As I understand, you very wish to call Cover() function instead of hlCover() - 
> by design, they should be identical, but accepts different document's 
> representation. So, the best way is generalize them: develop a new one which can 
> be called with some kind of callback or/and opaque structure to use it in both 
> rank and headline.
> 
> > 
> > NumFragments < 2:
> > I wanted people to use the new headline marker if they specify
> > NumFragments >= 1. If they do not specify the NumFragments or put it to
> Ok, but if you unify cover generation and NumFragments == 1 then result for old 
> and new algorithms should be the same...
> 
> 
> > On an another note I found that make_tsvector crashes if it receives a
> > ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
> > as 1 even when it gets 0 words. I am not sure if this is the desired
> > behavior.
> In all places there is a check before call of make_tsvector.
> 


In response to

Responses

pgsql-hackers by date

Next:From: Robert TreatDate: 2008-06-04 02:24:37
Subject: Re: rfc: add pg_dump options to dump output
Previous:From: Andrew DunstanDate: 2008-06-03 23:15:48
Subject: Re: proposal: Preference SQL

pgsql-general by date

Next:From: Klint GoreDate: 2008-06-04 00:17:27
Subject: Re: does postgresql works on distributed systems?
Previous:From: Steve CrawfordDate: 2008-06-04 00:10:43
Subject: Re: Generate SQL Statements

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group