Re: Thoughts on statistics for continuously advancing columns

From: Chetan Suttraway <chetan(dot)suttraway(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Nathan Boley <npboley(at)gmail(dot)com>
Subject: Re: Thoughts on statistics for continuously advancing columns
Date: 2010-01-05 12:49:02
Message-ID: d26e86811001050449y4bd30e4m1df9b16588d00ea1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

My suggestion is to keep two sets of histograms. One which is generated by
running ANALYZE and
the other which is dynamically generated histograms using the entries from
logging (that is done
in insert/update/delete operations).
I am not sure how difficult is it to read such record details from logs.

Basically from the details mentioned here what i understand is that the
table data (timestamp) is added
in incremental way, ie existing data is not modified to great extent and the
new data is
merely appended to old data.
In this case, the only work for analyse/statistics generation is to merge
the histograms of newly added records to old histograms.

if we can treat this case as similar to that of merging of histograms in
case of joins involving 2 tables and generating the histograms for the
cartesian (result) node, then all we need to do is somehow generate
temporary histogram for the new set of records and merge them with the old
histogram.
The information of interesting columns from the new records can be read from
the logging section.
We must be already having the part of merging of histograms and I hope that
it wont be very costly
to make these calls so as to effect planner.
(Further my opinion is to calculate this cost of histogram generation and
use it in costing in some way)

Further we can put some threshold limit to make this merge happen
automatically. Say if the temporary histograms reach some set threshold,
only then these will be merged with the older histograms.

Please pass on your inputs.

Regards,
Chetan

On Wed, Dec 30, 2009 at 8:38 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > My thoughts on dealing with this intelligently without a major change to
> > statstics gathering went along these lines:
>
> > 1. add columns to pg_statistic to hold estimates of upper and lower
> > bounds growth between analyzes.
>
> This seems like a fundamentally broken approach, first because "time
> between analyzes" is not even approximately a constant, and second
> because it assumes that we have a distance metric for all datatypes.
> (Note that convert_to_scalar does not assume that it can measure
> arbitrary distances, but only fractions *within* a histogram bucket;
> and even that is pretty shaky.)
>
> I don't have a better idea at the moment :-(
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2010-01-05 13:23:08 Re: quoting psql varible as identifier
Previous Message Markus Wanner 2010-01-05 12:08:41 Re: Testing with concurrent sessions