Re: Thoughts on statistics for continuously advancing columns

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Nathan Boley <npboley(at)gmail(dot)com>
Subject: Re: Thoughts on statistics for continuously advancing columns
Date: 2009-12-30 19:55:20
Message-ID: 6650.1262202920@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote:
>> This seems like a fundamentally broken approach, first because "time
>> between analyzes" is not even approximately a constant, and second
>> because it assumes that we have a distance metric for all datatypes.

> Maybe you could compute a correlation between the column values and the
> transaction numbers to recognize a continuously advancing column. It
> wouldn't tell you much about how fast they are advancing, but at least
> the typical use cases of serial and current timestamp columns should
> clearly stick out. And then instead of assuming that a value beyond the
> histogram bound doesn't exist, you assume for example the average
> frequency, which should be pretty good for the serial and timestamp
> cases. (Next step: Fourier analysis ;-) )

Actually, the histogram hasn't got much of anything to do with estimates
of the number of occurrences of a single value.

Josh hasn't shown us his specific problem query, but I would bet that
it's roughly like WHERE update_time > now() - interval 'something',
that is, the estimate that's problematic is an inequality not an
equality. When the range being asked for is outside the histogram
bounds, it really is rather difficult to come up with a reasonable
estimate --- you'd need a specific idea of how far outside the upper
bound it is, how fast the upper bound has been advancing, and how long
it's been since the last analyze. (I find the last bit particularly
nasty, because it will mean that plans change even when "nothing is
changing" in the database.)

[ thinks for awhile ... ]

Actually, in the problematic cases, it's interesting to consider the
following strategy: when scalarineqsel notices that it's being asked for
a range estimate that's outside the current histogram bounds, first try
to obtain the actual current max() or min() of the column value --- this
is something we can get fairly cheaply if there's a btree index on the
column. If we can get it, plug it into the histogram, replacing the
high or low bin boundary. Then estimate as we currently do. This would
work reasonably well as long as re-analyzes happen at a time scale such
that the histogram doesn't move much overall, ie, the number of
insertions between analyzes isn't a lot compared to the number of rows
per bin. We'd have some linear-in-the-bin-size estimation error because
the modified last or first bin actually contains more rows than other
bins, but it would certainly work a lot better than it does now.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-12-30 21:13:21 Re: pg_read_file() and non-ascii input file
Previous Message Robert Haas 2009-12-30 19:37:35 Re: KNNGiST for knn-search (WIP)