Re: Stats target increase vs compute_tsvector_stats()

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Stats target increase vs compute_tsvector_stats()
Date: 2008-12-14 10:58:53
Message-ID: 4944E6ED.4070800@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> I started making the changes to increase the default and maximum stats
> targets 10X, as I believe was agreed to in this thread:
> http://archives.postgresql.org/pgsql-hackers/2008-12/msg00386.php
>
> I came across this bit in ts_typanalyze.c:
>
> /* We want statistic_target * 100 lexemes in the MCELEM array */
> num_mcelem = stats->attr->attstattarget * 100;
>
> I wonder whether the multiplier here should be changed? This code is
> new for 8.4, so we have zero field experience about what desirable
> lexeme counts are; but the prospect of up to a million lexemes in
> a pg_statistic entry doesn't seem quite right. I'm tempted to cut the
> multiplier to 10 so that the effective range of MCELEM sizes remains
> the same as what Jan had in mind when he wrote the code.

The origin of that bit is this post:
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00556.php
and the following few downthread ones.

If we bump the default statistics target 10 times, then changing the
multiplier to 10 seems the right thing to do. Only thing that needs
caution is the frequency of pruning we do in the Lossy Counting
algorithm, that IIRC is correlated with the desired target length of the
MCELEM array.

BTW: I've been occupied with other things and might have missed some
discussions, but at some point it has been considered to use Lossy
Counting to gather statistics from regular columns, not only tsvectors.
Wouldn't this help the performance hit ANALYZE takes from upping
default_stats_target?

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2008-12-14 11:15:16 Re: visibility map and reltuples
Previous Message Hiroshi Inoue 2008-12-14 10:22:02 upper()/lower() truncates the result under Japanese Windows