Re: bitmap heap scan way cheaper than seq scan on the same amount of tuples (fts-search).

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jesper Krogh <jesper(at)krogh(dot)cc>
Cc: pgsql-performance(at)postgresql(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: bitmap heap scan way cheaper than seq scan on the same amount of tuples (fts-search).
Date: 2009-10-29 15:22:54
Message-ID: 5947.1256829774@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Jesper Krogh <jesper(at)krogh(dot)cc> writes:
> I'm currently trying to figure out why the tsearch performance seems to
> vary a lot between different queryplans. I have created a sample dataset
> that sort of resembles the data I have to work on.

> The script that builds the dataset is at:
> http://krogh.cc/~jesper/build-test.pl
> and http://krogh.cc/~jesper/words.txt is needed for it to run.

I got around to looking at this example finally, and I can reproduce
your results pretty closely. I think there are two things going on:

1. The cost estimates for to_tsquery and ts_match_vq don't reflect the
actually-rather-high costs of those functions. Since the seqscan plan
executes these functions many more times than the indexscan plan, that
results in a relative cost error. There's already been some discussion
of changing the default costs for the tsearch functions, but nothing's
been done yet. However, that seems to be a relatively small problem
compared to...

2. The planner is estimating that most of the GIN index has to be
examined --- specifically, it estimates (pretty accurately) that
40188 out of 50000 table rows will match, and the default assumption
is that that means 40188/50000 of the index blocks will have to be
read. On my machine the index amounts to 39076 blocks, so we
estimate 31407 index blocks have to be read, and that's why the cost
estimate for the indexscan is so huge. The *actual* number of index
blocks read for the query, according to the stats collector, is 66.

So it appears that genericcostestimate() is just completely
inappropriate for GIN indexes, at least when considering common terms.
I guess that's not so astonishing when you remember that GIN isn't built
around per-heap-tuple entries the way the other index types are.
Oleg, Teodor, can you suggest any better cost metric to use for GIN?

regards, tom lane

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Bob Lunney 2009-10-29 15:24:17 Re: sub-select in IN clause results in sequential scan
Previous Message Peter Meszaros 2009-10-29 14:44:05 database size growing continously