Re: [GENERAL] how to get accurate values in pg_statistic

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Mary Edie Meredith <maryedie(at)osdl(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>, osdldbt-general <osdldbt-general(at)lists(dot)sourceforge(dot)net>
Subject: Re: [GENERAL] how to get accurate values in pg_statistic
Date: 2003-09-07 16:22:55
Message-ID: 200309071622.h87GMt909846@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


I have learned you can use:

SET random = 0;

to force identical statistics every time you run ANALYZE.

---------------------------------------------------------------------------

Mary Edie Meredith wrote:
> I certainly don't claim that it is appropriate to force customers into a
> full analysis, particularly if random sampling versus a full scan of the
> data reveals little to no performance differences in the plans. Being
> able to sample accurately is _very nice for large tables.
>
> For our testing purposes, however, consistent results are extremely
> important. We have observed that small difference in one plan for one of
> 22 queries can cause a difference in the DBT-3 results. If this
> happens, a small change in performance runs between two Linux kernels
> may appear to be due to the kernels, when in fact it is due to the plan
> change.
>
> We know that the plans are _exactly the same if the data in the
> pg_statistics table is the same from run to run (all other things being
> equal). So what we need to have is identical optimizer costs
> (pg_statistics) for the same table data for each.
>
> I feel certain that the pg_statistics table will be identical from run
> to run if analyze looks at every row. Thus our hope to find a way to
> get that.
>
> We did runs over night. We can confirm that VACUUM FULL ANALYZE does
> not produce the same pg_statistics run to run. With the default (10)
> default_statistics_target the plans are also different.
>
> We ran additional tests with default_statistics_target set to 1000 (the
> max I believe). The plans are the same over the different runs, but the
> pg_statistics table has different cost values. The performance results
> of the runs are consistent (we would expect this with the same plans).
> The resulting performance metrics are similar to the best plans we see
> using the default histogram size (good news).
>
> However, we worry that one day the cost will change enough for whatever
> reason to cause a plan change, especially for a larger database scale
> factor (database size/row size).
>
> I know we appear to be an isolated case, but customers also do testing
> and may have the same consistency issues we have. I can also imagine
> cases where customers want to guarantee that plans stay the same
> (between replicated sites, for example). If two developers are
> analyzing changes to the optimizer, don't you want the costs used for
> testing on their two systems to be identical for comparison purposes?
>
> Anyway, IMHO I believe that an option for an ANALYZE FULL ("sampling"
> all rows) would be valuable. Any other ideas for how to force this
> without code change are very welcome.
>
> Thanks for your info!
>
>
>
> On Thu, 2003-09-04 at 16:16, Tom Lane wrote:
> > Mary Edie Meredith <maryedie(at)osdl(dot)org> writes:
> > > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > > STATISTICS.
> >
> > > These determine the number of bins in the histogram for a given column.
> > > But for a large number of rows (for example 6 million) the maximum value
> > > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > > We do not see a way to guarantee the same statistics run to run without
> > > forcing ANALYZE to examine every row of every table.
> >
> > Do you actually still have a problem with the plans changing when the
> > stats target is above 100 or so? I think the notion of "force ANALYZE
> > to do a full scan" is inherently wrongheaded ... it certainly would not
> > produce numbers that have anything to do with ordinary practice.
> >
> > If you have data statistics that are so bizarre that the planner still
> > gets things wrong with a target of 1000, then I'd like to know more
> > about why.
> >
> > regards, tom lane
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> > (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
> --
> Mary Edie Meredith <maryedie(at)osdl(dot)org>
> Open Source Development Lab
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Pailloncy Jean-Gérard 2003-09-07 18:04:20 slow plan for min/max
Previous Message Jim C. Nasby 2003-09-07 05:23:24 Poor pg_dump performance