Re: Cross-column statistics revisited

From: Richard Huxton <dev(at)archonet(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 11:17:49
Message-ID: 48F8745D.1050205@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark wrote:
> They're certainly very much not independent variables. There are lots of ways
> of measuring how much dependence there is between them. I don't know enough
> about the math to know if your maps are equivalent to any of them.

I think "dependency" captures the way I think about it rather than
correlation (although I can see there must be function that could map
that dependency onto how we think of correlations).

> In any case as I described it's not enough information to know that the two
> data sets are heavily dependent. You need to know for which pairs (or ntuples)
> that dependency results in a higher density and for which it results in lower
> density and how much higher or lower. That seems like a lot of information to
> encode (and a lot to find in the sample).

Like Josh Berkus mentioned a few points back, it's the handful of
plan-changing values you're looking for.

So, it seems like we've got:
1. Implied dependencies: zip-code=>city
2. Implied+constraint: start-date < end-date and the difference between
the two is usually less than a week
3. "Top-heavy" foreign-key stats.

#1 and #2 obviously need new infrastructure.

From a non-dev point of view it looks like #3 could use the existing
stats on each side of the join. I'm not sure whether you could do
anything meaningful for joins that don't explicitly specify one side of
the join though.

> Perhaps just knowing whether that there's a dependence between two data sets
> might be somewhat useful if the planner kept a confidence value for all its
> estimates. It would know to have a lower confidence value for estimates coming
> from highly dependent clauses? It wouldn't be very easy for the planner to
> distinguish "safe" plans for low confidence estimates and "risky" plans which
> might blow up if the estimates are wrong though. And of course that's a lot
> less interesting than just getting better estimates :)

If we could abort a plan and restart then we could just try the
quick-but-risky plan and if we reach 50 rows rather than the expected 10
try a different approach. That way we'd not need to gather stats, just
react to the situation in individual queries.

--
Richard Huxton
Archonet Ltd

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-10-17 12:46:11 Re: Cross-column statistics revisited
Previous Message Pavel Stehule 2008-10-17 10:16:05 WIP: grouping sets support