From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|

To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |

Cc: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |

Subject: | Re: multivariate statistics (v19) |

Date: | 2016-10-29 19:23:34 |

Message-ID: | 61e71067-9461-d785-b4a6-6e8a08996d5f@2ndquadrant.com |

Views: | Raw Message | Whole Thread | Download mbox |

Thread: | |

Lists: | pgsql-hackers |

Hi,

Attached is v20 of the multivariate statistics patch series, doing

mostly the changes outlined in the preceding e-mail from October 11.

The patch series currently has these parts:

* 0001 : (FIX) teach pull_varno about RestrictInfo

* 0002 : (PATCH) shared infrastructure and ndistinct coefficients

* 0003 : (PATCH) functional dependencies (only the ANALYZE part)

* 0004 : (PATCH) selectivity estimation using functional dependencies

* 0005 : (PATCH) multivariate MCV lists

* 0006 : (PATCH) multivariate histograms

* 0007 : (WIP) selectivity estimation using ndistinct coefficients

* 0008 : (WIP) use multiple statistics for estimation

* 0009 : (WIP) psql tab completion basics

Let me elaborate about the main changes in this version:

1) rework CREATE STATISTICS to what Dean Rasheed proposed in [1]:

-----------------------------------------------------------------------

CREATE STATISTICS name WITH (options) ON (columns) FROM table

This allows adding support for statistics on joins, expressions

referencing multiple tables, and partial statistics (with WHERE

predicates, similar to indexes). Although those things are not

implemented (and I don't know if/when that happens), it's good the

syntax supports it.

I've been thinking about using "CREATE STATISTIC" instead, but I decided

to stick with "STATISTICS" for two reasons. Firstly it's possible to

create multiple statistics in a single command, for example by using

WITH (mcv,histogram). And secondly, we already hava "ALTER TABLE ... SET

STATISTICS n" (although that tweaks the statistics target for a column,

not the statistics on the column).

2) no changes to catalog names

-----------------------------------------------------------------------

Clearly, naming things is one of the hardest things in computer science.

I don't have a good idea what names would be better than the current

ones. In any case, this is fairly trivial to do.

3) special data types for statistics

-----------------------------------------------------------------------

Heikki proposed to invent a new data type, similar to pg_node_tree. I do

agree that storing the stats in plain bytea (i.e. catalog having bytea

columns) was not particularly convenient, but I'm not sure how much of

pg_node_tree Heikki wanted to copy.

In particular, I'm not sure whether Heikki's idea was store all the

statistics together in a single Datum, serialized into a text string

(similar to pg_node_tree).

I don't think that would be a good idea, as the statistics may be quite

large and complex, and deserializing them from text format would be

quite expensive. For pg_node_tree that's not a major issue because the

values are usually fairly small. Similarly, packing everything into a

single datum would force the planner to parse/unpack everything, even if

it needs just a small piece (e.g. the ndistinct coefficients, but not

histograms).

So I've decided to invent new data types, one for each statistic type:

* pg_ndistinct

* pg_dependencies

* pg_mcv_list

* pg_histogram

Similarly to pg_node_tree those data types only support output, i.e.

both 'recv' and 'in' functions do elog(ERROR). But while pg_node_tree is

stored as text, those new data types are still bytea.

I do believe this is a good solution, and it allows casting the data

types to text easily, as it simply calls the out function.

The statistics however do not store attnums in the bytea, just indexes

into pg_mv_statistic.stakeys. That means the out functions can't print

column names in the output, or values (because without the attnum we

don't know the type, and thus can't lookup the proper out function).

I don't think there's a good solution for that (I was thinking about

storing the attnums/typeoid in the statistics itself, but that seems

fairly ugly). And I'm quite happy with those new data types.

4) replace functional dependencies with ndistinct (in the first patch)

-----------------------------------------------------------------------

As the ndistinct coeffients are simpler than functional dependencies,

I've decided to use them in the fist patch in the series, which

implements the shared infrastructure. This does not mean throwing away

functional dependencies entirely, just moving them to a later patch.

5) rework of ndistinct coefficients

-----------------------------------------------------------------------

The ndistinct coefficients were also significantly reworked. Instead of

computing and storing the value for the exact combination of attributes,

the new version computes ndistinct for all combinations of attributes.

So for example with CREATE STATISTICS x ON (a,b,c) the old patch only

computed ndistinct on (a,b,c), while the new patch computes ndistinct on

{(a,b,c), (a,b), (a,c), (b,c)}. This makes it way more powerful.

The first patch (0002) only uses this in estimate_num_groups to improve

GROUP BY estimates. A later patch (0007) shows how it might be used for

selectivity estimation, but it's a very early WIP at this point.

Also, I'm not sure we should use ndistinct coefficients this way,

because of the "homogenity" assumption, similarly to functional

dependencies. Functional dependencies are used only for selectivity

estimation, so it's quite easy not to use them if they don't work for

that purpose. But ndistinct coefficients are also used for GROUP BY

estimation, where the homogenity assumption is not such a big deal. So I

expect people to add ndistinct, get better GROUP BY estimates but

sometimes worse selectivity estimates - not great, I guess.

But the selectivity estimation using ndistinct coefficients is very

simple right now - in particular it does not use the per-clause

selectivities at all, it simply assumes the whole selectivity is

1/ndistinct for the combination of columns.

Functional dependencies use this formula to combine the selectivities:

P(a,b) = P(a) * [f + (1-f)*P(b)]

so maybe there's something similar for ndistinct coefficients? I mean,

let's say we know ndistinc(a), ndistinct(b), ndistinct(a,b) and P(a)

and P(b). How do we compute P(a,b)?

5) rework functional dependencies

-----------------------------------------------------------------------

Based on Dean's feedback, I've reworked functional dependencies to use

continuous "degree" of validity (instead of true/false behavior,

resulting in sudden changes in behavior).

This significantly reduced the amount of code, because the old patch

tried to identify transitive dependencies (to minimize time and storage

requirements). Switching to continuous degree makes this impossible (or

at least far more complicated), so I've simply ripped all of this out.

This means the statistics will be larger and ANALYZE will take more

time, the differences are fairly small in practice, and the estimation

actually seems to work better.

6) MCV and histogram changes

-----------------------------------------------------------------------

Those statistics types are mostly unchanged, except for a few minor bug

fixes and removal of remove max_mcv_items and max_buckets options.

Those options were meant to allow users to limit the size of the

statistics, but the implementation was ignoring them so far. So I've

ripped them out, and if needed we may reintroduce them later.

7) no more (elaborate) combinations of statistics

-----------------------------------------------------------------------

I've ripped out the patch that combined multiple statistics in very

elaborate way - it was overly complex, possibly wrong, but most

importantly it distracted people from the preceding patches. So I've

ripped this out, and instead replaced that with a very simple approach

that allows using multiple statistics on different subsets if the clause

list. So for example

WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

may benefit from two statistics, one on (a,b) and second on (c,d). It's

very simple approach, but it does the trick for many cases and is better

than "single statistics" limitation.

The 0008 patch is actually very simple, essentially adding just a loop

into the code blocks, so I think it's quite likely this will get merged

into the preceding patches.

8) reduce table sizes used in regression tests

-----------------------------------------------------------------------

Some of the regression tests used quite large tables (with up to 1M

rows), which had two issues - long runtimes and unstability (because the

ANALYZE sample is only 30k rows, so there were sometimes small changes

due to picking a different sample). I've limited the table sizes to 30k

rows.

8) open / unsolved questions

-----------------------------------------------------------------------

The main open question is still whether clausesel.c is the best place to

do all the heavy lifting (particularly matching clauses and statistics,

and deciding which statistics to use). I suspect some of that should be

done elsewhere (earlier in the planning), enriching the query tree

somehow. Then clausesel.c would "only" compute the estimates, and it

would also allow showing the info in EXPLAIN.

I'm not particularly happy with the changes in claselist_selectivity

look right now - there are three almost identical blocks, so this would

deserve some refactoring. But I'd like to get some feedback first.

regards

[2]

https://www.postgresql.org/message-id/1c7e4e63-769b-f8ce-f245-85ef4f59fcba%40iki.fi

--

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment | Content-Type | Size |
---|---|---|

multivariate-stats-v20.tgz | application/x-compressed-tar | 140.6 KB |

- Re: multivariate statistics (v19) at 2016-10-11 03:39:23 from Tomas Vondra

- Re: multivariate statistics (v19) at 2016-12-12 11:26:33 from Amit Langote

From | Date | Subject | |
---|---|---|---|

Next Message | Tom Lane | 2016-10-29 19:23:45 | Re: Speed of user-defined aggregates using array_append as transfn |

Previous Message | Michael Paquier | 2016-10-29 13:12:27 | Re: Streaming basebackups vs pg_stat_tmp |