Re: tsvector pg_stats seems quite a bit off.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: tsvector pg_stats seems quite a bit off.
Date: 2010-05-30 22:07:28
Message-ID: 20441.1275257248@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> Here's a patch against recent git, but should apply to 8.4 sources as
> well. It would be interesting to measure the memory and time needed to
> analyse the table after applying it, because we will be now using a lot
> bigger bucket size and I haven't done any performance impact testing on
> it.

I did a little bit of testing using a dataset I had handy (a couple
hundred thousand publication titles) and found that ANALYZE seems to be
noticeably but far from intolerably slower --- it's almost the same
speed at statistics targets up to 100, and even at the max setting of
10000 it's only maybe 25% slower. However I'm not sure if this result
will scale to very large document sets, so more testing would be a good
idea.

I committed the attached revised version of the patch. Revisions are
mostly minor but I did make two substantive changes:

* The patch changed the target number of mcelems from 10 *
statistics_target to just statistics_target. I reverted that since
I don't think it was intended; at least we hadn't discussed it.

* I modified the final processing to avoid one qsort step if there are
fewer than num_mcelems hashtable entries that pass the cutoff frequency
filter, and in any case to sort only those entries that pass it rather
than all of them. With the significantly larger number of hashtable
entries that will now be used, it seemed like a good thing to try to
cut the qsort overhead.

regards, tom lane

Attachment Content-Type Size
ts-typanalyze-fix-2.patch text/x-patch 11.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Urbański 2010-05-30 22:24:59 Re: tsvector pg_stats seems quite a bit off.
Previous Message Andres Freund 2010-05-30 20:48:37 Re: Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up