Re: Idea for the statistics collector

From: Doug Fields <dfields(at)pexicom(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Idea for the statistics collector
Date: 2002-06-21 15:07:09
Message-ID: 5.1.0.14.2.20020621105749.02b99b78@pop.pexicom.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers


>Tom Lane wrote:
> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > Martijn van Oosterhout wrote:
> > >> Firstly, I was only thinking of going for the basic nodes (Index
> Scan, Seq
> > >> Scan, Distinct). Other types have far more variables. Secondly, even
> if you
> > >> only count, it's useful. For example, if it tells you that the
> planner is
> > >> off by a factor of 10 more than 75% of the time, that's useful
> information
> > >> independant of what the actual variables are.
> >
> > And reduce the usefulness even more ;-). As a pure stats-gathering
> > exercise it might be worth doing, but not if you only log the failure
> > cases. How will you know how well you are doing if you take a
> > biased-by-design sample?

Personally, given that it seems like at least once or twice a day someone
asks about performance or "why isn't my index being used" and other stuff -
I think doing this would be a great idea.

Perhaps not necessarily in the full-fledged way, but creating a sort of
"ANALYZE log," wherein it logs the optimizer's estimate of a query and the
actual results of a query, for every query. This, of course, could be
enableable/disableable on a per-connection basis, per-table basis (like
OIDs), or whatever other basis makes life easiest to the developers.

Then, when the next ANALYZE is run, it could do it's usual analysis, and
apply some additional heuristics based upon what it learns from the
"ANALYZE log," possibly to do several things:

1) Automatically increase/decrease the SET STATISTICS information included
in the analyze, for example, increasing it as a table grows larger and the
"randomness" grows less than linearly with size (e.g., if you have 50 or 60
groups in a 1,000,000 row table, that certainly needs a higher SET
STATISTICS and I do it on my tables).
2) Have an additional value on the statistics table called the
"index_heuristic" or "random_page_adjustment_heuristic" which when 1 does
nothing, but otherwise modifies the cost of using an index/seq scan by that
factor - and don't ever change this more than a few percent each ANALYZE
3) Flags in a second log (maybe the regular log) really bad query estimates
- let it do an analysis of the queries and flag anything two or three std
deviations outside.

Now, I suggest all this stuff in the name of usability and
self-maintainability. Unfortunately, I don't have the wherewithal to
actually assist in development.

Another possibility is to put "use_seq_scan" default to OFF, or whatever
the parameter is (I did my optimizing a while ago so it's fading), so that
if there's an index, it will use it, regardless - as this seems to be what
the great majority of people expect to happen. And/or add this to a FAQ,
and let us all reply "see http://.../indexfaq.html." :)

Cheers,

Doug

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Barry Lind 2002-06-21 16:19:21 timezone incompatibility
Previous Message Chris Bowlby 2002-06-21 14:47:07 Re: login issue..

Browse pgsql-hackers by date

  From Date Subject
Next Message Julian Mehnle, Linksystem Muenchen 2002-06-21 15:24:56 Re: ODBC Driver 7.02.0001 (Win32) (Unicode mode): CRLF->LF works, LF->CRLF doesn't
Previous Message Bruce Momjian 2002-06-21 15:03:25 Re: What is wrong with hashed index usage?