Re: For full text indexing, which is better, tsearch2 or

From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: For full text indexing, which is better, tsearch2 or
Date: 2003-11-28 05:04:17
Message-ID: 20031128050417.GA14227@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Nov 26, 2003 at 09:12:30PM -0800, Steve Atkins wrote:
> On Thu, Nov 27, 2003 at 12:41:59PM +0800, Christopher Kings-Lynne wrote:
> > >Does anyone have any metrics on how fast tsearch2 actually is?
> > >
> > >I tried it on a synthetic dataset of a million documents of a hundred
> > >words each and while insertions were impressively fast I gave up on
> > >the search after 10 minutes.
> > >
> > >Broken? Unusable slow? This was on the last 7.4 release candidate.
> >
> > I just created a 1.1million row dataset by copying one of our 30000 row
> > production tables and just taking out the txtidx column. Then I
> > inserted it into itself until it had 1.1 million rows.
> >
> > Then I created the GiST index - THAT took forever - seriously like 20
> > mins or half an hour or something.
> >
> > Now, to find a word:
> >
> > select * from tsearchtest where ftiidx ## 'curry';
> > Time: 9760.75 ms
>
> > So, I have no idea why you think it's slow? Perhaps you forgot the
> > 'create index using gist' step?
>
> No, it was indexed.
>
> Thanks, that was the datapoint I was looking for. It _can_ run fast, so
> I just need to work out what's going on. (It's hard to diagnose a slow
> query when you've no idea whether it's really 'slow').

Looking at it further, something is very broken, possibly with GIST
indices, possibly with tsearch2s use of 'em.

This is on a newly built 7.4 installation, built with 64 bit
datetimes, but completely stock other than that. Stock gcc 3.3.2,
Linux, somewhat elderly 2.4.18 kernel. Running on a 1.5GHz single
processor Athlon with a half gig of RAM. Configuration set to use 20%
of RAM as shared buffers (amongst other settings, this was the last of
a range I tried looking for variation).

Software RAID0 across two 7200RPM SCSI drives, reiserfs (it's a
development box, not a production system). System completely idle
apart from postgresql.

269000 rows, each row having 400 words. Analyzed.

Running the select query given below appears to pause a process trying
to insert into the table completely (locking issue? I/O bandwidth?).

top shows the select below consuming <2% of CPU and iostat shows it reading
~2800 blocks/second from each of the two RAID drives.

Physical size of the database is under 3 gigs, including toast and index
tables.

The select query takes around 6 minutes (consistently, even if the same
identical query is repeated).

For entertainment, I turned off indexscan and the query takes 1
minute with a simple seqscan.

Any thoughts?

Cheers,
Steve

=> select count(*) from ftstest;
count
--------
269000
(1 row)

=> \d ftstest
Table "public.ftstest"
Column | Type | Modifiers
--------+----------+----------------------------------------------------------
idx | integer | not null default nextval('public.ftstest_idx_seq'::text)
words | text | not null
idxfti | tsvector | not null
Indexes:
"ftstest_idx" gist (idxfti)

=> explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Index Scan using ftstest_idx on ftstest (cost=0.00..515.90 rows=271 width=4) (actual time=219.694..376042.428 rows=4796 loops=1)
Index Cond: (idxfti @@ '\'dominican\''::tsquery)
Filter: (idxfti @@ '\'dominican\''::tsquery)
Total runtime: 376061.541 ms
(4 rows)

((Set enable_indexscan=false))

=> explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on ftstest (cost=0.00..5765.88 rows=271 width=4) (actual time=42.589..62158.285 rows=4796 loops=1)
Filter: (idxfti @@ '\'dominican\''::tsquery)
Total runtime: 62182.277 ms
(3 rows)

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Christopher Kings-Lynne 2003-11-28 05:18:48 Re: For full text indexing, which is better, tsearch2 or
Previous Message Stefan Champailler 2003-11-27 19:57:06 Re: Impossibly slow DELETEs