Re: Thoughts on statistics for continuously advancing columns

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Nathan Boley <npboley(at)gmail(dot)com>
Subject: Re: Thoughts on statistics for continuously advancing columns
Date: 2010-01-04 02:56:09
Message-ID: 20469.1262573769@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Actually, in the problematic cases, it's interesting to consider the
> following strategy: when scalarineqsel notices that it's being asked for
> a range estimate that's outside the current histogram bounds, first try
> to obtain the actual current max() or min() of the column value --- this
> is something we can get fairly cheaply if there's a btree index on the
> column. If we can get it, plug it into the histogram, replacing the
> high or low bin boundary. Then estimate as we currently do. This would
> work reasonably well as long as re-analyzes happen at a time scale such
> that the histogram doesn't move much overall, ie, the number of
> insertions between analyzes isn't a lot compared to the number of rows
> per bin. We'd have some linear-in-the-bin-size estimation error because
> the modified last or first bin actually contains more rows than other
> bins, but it would certainly work a lot better than it does now.

I've applied a patch to HEAD that does the above. Can you test it to
see how well it fixes your problem?

Looking at the current uses of the histogram stats, there is another
place that could possibly benefit from this type of on-demand index
search: mergejoinscansel. That code attempts to estimate how much of a
column actually needs to be scanned by a merge join, recognizing that a
merge will stop reading either input once the other is exhausted.
Having an accurate idea of the join keys' upper bounds is fairly
important here, since the supposed cost reduction from not scanning all
of a table disappears if there's even one large-valued key on the other
side of the join. On the other hand, making use of index searches in
mergejoinscansel would put these index searches into the standard, very
heavily used join planning path, not into a corner case which is all
they are right now. So I'm fairly nervous about the potential hit on
join planning time. Is there anyone around who can do planner timing
measurements on complex join queries involving large tables? If so,
please try CVS HEAD as-is and after enabling this bit in selfuncs.c:

/*
* XXX It's very tempting to try to use the actual column min and max,
* if we can get them relatively-cheaply with an index probe. However,
* since this function is called many times during join planning,
* that could have unpleasant effects on planning speed. Need more
* investigation before enabling this.
*/
#ifdef NOT_USED
if (get_actual_variable_range(root, vardata, sortop, min, max))
return true;
#endif

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-01-04 03:17:30 Re: pg_migrator issues
Previous Message Takahiro Itagaki 2010-01-04 02:50:56 Re: New VACUUM FULL