tsvector pg_stats seems quite a bit off.

From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: pgsql-hackers(at)postgresql(dot)org
Subject: tsvector pg_stats seems quite a bit off.
Date: 2010-05-19 19:01:18
Message-ID: 4BF4357E.6000505@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi.

I am working on getting full-text-search to work and have
come across something I think look a bit strange.

The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.

# ANALYZE verbose reference (document_tsvector);
INFO: analyzing "reference"
INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live
rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows
ANALYZE

Ok, so analyze allmost examined all rows. Looking into
"most_common_freqs" I find
# select count(unnest) from (select unnest(most_common_freqs) from
pg_stats where attname = 'document_tsvector') as foo;
count
-------
2810
(1 row)

But the distribution is very "flat" at the end, the last 128 values are
excactly
1.00189e-05
which means that any term sitting outside the array would get an estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

So far I have no idea if this is bad or good, so a couple of sample runs
of stuff that
is sitting outside the "most_common_vals" array:

# explain analyze select id from efam.reference where document_tsvector
@@ to_tsquery('searchterm') order by id limit 2000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=35.99..35.99 rows=2 width=4) (actual time=20.717..28.135
rows=1612 loops=1)
-> Sort (cost=35.99..35.99 rows=2 width=4) (actual
time=20.709..23.190 rows=1612 loops=1)
Sort Key: id
Sort Method: quicksort Memory: 124kB
-> Bitmap Heap Scan on reference (cost=28.02..35.98 rows=2
width=4) (actual time=3.522..17.238 rows=1612 loops=1)
Recheck Cond: (document_tsvector @@
to_tsquery('searchterm'::text))
-> Bitmap Index Scan on reference_fts_idx
(cost=0.00..28.02 rows=2 width=0) (actual time=3.378..3.378 rows=1613
loops=1)
Index Cond: (document_tsvector @@
to_tsquery('searchterm'::text))
Total runtime: 30.743 ms
(9 rows)

Ok, the query-planner estimates that there are 2 rows .. excactly as
predicted, works as expected but
in fact there are 1612 rows that matches.

So, analyze has sampled 6 of 7 rows in the table and this term exists in
1612/350174 rows ~ freq: 0.0046 which
is way higher than the lower bound of 1.00189e-05 .. or it should have
been sitting around the center of the 2810
values of the histogram collected.

So the "most_common_vals" seems to contain a lot of values that should
never have been kept in favor
of other values that are more common.

In practice, just cranking the statistics estimate up high enough seems
to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?

# select version();
version
---------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.0beta1 on x86_64-unknown-linux-gnu, compiled by GCC
gcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit

Jesper
--
Jesper

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2010-05-19 19:53:18 pg_upgrade docs
Previous Message Bruce Momjian 2010-05-19 18:45:43 Re: pg_upgrade - link mode and transaction-wraparound data loss