From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|

To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> |

Subject: | Re: cost_sort() improvements |

Date: | 2018-07-12 15:00:27 |

Message-ID: | 72f7e61e-95ad-ee6a-ba62-a500dd089f1a@sigaev.ru |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

> One more thought about estimating the worst case - I wonder if simply

> multiplying the per-tuple cost by 1.5 is the right approach. It does not

> seem particularly principled, and it's trivial simple to construct

> counter-examples defeating it (imagine columns with 99% of the rows

> having the same value, and then many values in exactly one row - that

> will defeat the ndistinct-based group size estimates).

Agree. But right now that is best what we have. and constant 1.5 easily could be

changed to 1 or 10 - it is just arbitrary value, intuitively chosen. As I

mentioned in v7 patch description, new estimation function provides ~10% bigger

estimation and this constant should not be very large, because we could easily

get overestimation.

>

> The reason why constructing such counter-examples is so simple is that

> this relies entirely on the ndistinct stats, ignoring the other stats we

> already have. I think this might leverage the per-column MCV lists (and

> eventually multi-column MCV lists, if that ever gets committed).

>

> We're estimating the number of tuples in group (after fixing values in

> the preceding columns), because that's what determines the number of

> comparisons we need to make at a given stage. The patch does this by

> estimating the average group size, by estimating ndistinct of the

> preceding columns G(i-1), and computing ntuples/G(i-1).

>

> But consider that we also have MCV lists for each column - if there is a

> very common value, it should be there. So let's say M(i) is a frequency

> of the most common value in i-th column, determined either from the MCV

> list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)

> for the fist i columns, then the largest possible group of values when

> comparing values in i-th column is

>

> N * m(i-1)

>

> This seems like a better way to estimate the worst case. I don't know if

> this should be used instead of NG(i), or if those two estimates should

> be combined in some way.

Agree, definitely we need an estimation of larger group size to use it in

cost_sort. But I don't feel power to do that, pls, could you attack a computing

group size issue? Thank you.

>

> I think estimating the largest group we need to sort should be helpful

> for the incremental sort patch, so I'm adding Alexander to this thread.Agree

I think so. suggested estimation algorithm should be modified to support leading

preordered keys and this form could be directly used in GROUP BY and incremental

sort patches.

--

Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru

WWW: http://www.sigaev.ru/

- Re: cost_sort() improvements at 2018-07-10 20:56:48 from Tomas Vondra

- Re: cost_sort() improvements at 2018-07-22 20:13:39 from Tomas Vondra

From | Date | Subject | |
---|---|---|---|

Next Message | Nikita Glukhov | 2018-07-12 15:06:21 | Re: [PATCH] Add missing type conversion functions for PL/Python |

Previous Message | Teodor Sigaev | 2018-07-12 14:48:21 | Re: cost_sort() improvements |