From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|

To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |

Subject: | Re: Using multiple extended statistics for estimates |

Date: | 2019-11-06 19:54:40 |

Message-ID: | 20191106195440.wtxsyhiwdidl37hv@development |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:

>Hi,

>

>PostgreSQL 10 introduced extended statistics, allowing us to consider

>correlation between columns to improve estimates, and PostgreSQL 12

>added support for MCV statistics. But we still had the limitation that

>we only allowed using a single extended statistics per relation, i.e.

>given a table with two extended stats

>

> CREATE TABLE t (a int, b int, c int, d int);

> CREATE STATISTICS s1 (mcv) ON a, b FROM t;

> CREATE STATISTICS s2 (mcv) ON c, d FROM t;

>

>and a query

>

> SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

>

>we only ever used one of the statistics (and we considered them in a not

>particularly well determined order).

>

>This patch addresses this by using as many extended stats as possible,

>by adding a loop to statext_mcv_clauselist_selectivity(). In each step

>we pick the "best" applicable statistics (in the sense of covering the

>most attributes) and factor it into the oveall estimate.

>

>All this happens where we'd originally consider applying a single MCV

>list, i.e. before even considering the functional dependencies, so

>roughly like this:

>

> while ()

> {

> ... apply another MCV list ...

> }

>

> ... apply functional dependencies ...

>

>

>I've both in the loop, but I think that'd be wrong - the MCV list is

>expected to contain more information about individual values (compared

>to functional deps, which are column-level).

>

Here is a slightly polished v2 of the patch, the main difference being

that computing clause_attnums was moved to a separate function.

This is a fairly simple patch, and it's not entirely new functionality

(applying multiple statistics was part of the very first patch seris,

although of course in a very different form). So unless there are

objections, I'd like to get this committed sometime next week.

There's room for improvement, of course, for example when handling

overlapping statistics. Consider a table with columns (a,b,c) and two

extended statistics on (a,b) and (b,c), and query with one clause per

column

SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1

In this case the patch does not help, because we apply (a,b) and then we

have just a single clause remaining. What we could do is still apply the

(b,c) statistic, using the already-estimated clause on b as a condition.

So essentially we'd compute

P(a=1 && b=1) * P(c=1 | b=1)

But that'll require larger changes, and I see it as an evolution of the

current patch.

regards

--

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

- Using multiple extended statistics for estimates at 2019-10-28 15:20:48 from Tomas Vondra

- Re: Using multiple extended statistics for estimates at 2019-11-06 19:58:49 from Tomas Vondra

From | Date | Subject | |
---|---|---|---|

Next Message | Tomas Vondra | 2019-11-06 19:58:49 | Re: Using multiple extended statistics for estimates |

Previous Message | Tom Lane | 2019-11-06 19:48:29 | Re: define bool in pgtypeslib_extern.h |