From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|

To: | Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> |

Subject: | Re: cost_sort() improvements |

Date: | 2018-07-22 20:13:39 |

Message-ID: | b80be535-273f-b026-575b-c6fdc4f33e00@2ndquadrant.com |

Views: | Raw Message | Whole Thread | Download mbox |

Thread: | |

Lists: | pgsql-hackers |

On 07/12/2018 05:00 PM, Teodor Sigaev wrote:

>

>> One more thought about estimating the worst case - I wonder if simply

>> multiplying the per-tuple cost by 1.5 is the right approach. It does not

>> seem particularly principled, and it's trivial simple to construct

>> counter-examples defeating it (imagine columns with 99% of the rows

>> having the same value, and then many values in exactly one row - that

>> will defeat the ndistinct-based group size estimates).

>

> Agree. But right now that is best what we have. and constant 1.5 easily

> could be changed to 1 or 10 - it is just arbitrary value, intuitively

> chosen. As I mentioned in v7 patch description, new estimation function

> provides ~10% bigger estimation and this constant should not be very

> large, because we could easily get overestimation.

>

>>

>> The reason why constructing such counter-examples is so simple is that

>> this relies entirely on the ndistinct stats, ignoring the other stats we

>> already have. I think this might leverage the per-column MCV lists (and

>> eventually multi-column MCV lists, if that ever gets committed).

>>

>> We're estimating the number of tuples in group (after fixing values in

>> the preceding columns), because that's what determines the number of

>> comparisons we need to make at a given stage. The patch does this by

>> estimating the average group size, by estimating ndistinct of the

>> preceding columns G(i-1), and computing ntuples/G(i-1).

>>

>> But consider that we also have MCV lists for each column - if there is a

>> very common value, it should be there. So let's say M(i) is a frequency

>> of the most common value in i-th column, determined either from the MCV

>> list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)

>> for the fist i columns, then the largest possible group of values when

>> comparing values in i-th column is

>>

>> N * m(i-1)

>>

>> This seems like a better way to estimate the worst case. I don't know if

>> this should be used instead of NG(i), or if those two estimates should

>> be combined in some way.

> Agree, definitely we need an estimation of larger group size to use it

> in cost_sort. But I don't feel power to do that, pls, could you attack a

> computing group size issue? Thank you.

>

Attached is a simple patch illustrating this MCV-based approach. I don't

claim it's finished / RFC, but hopefully it's sufficient to show what I

meant. I'm not sure how to plug it into the v8 of your patch at this

point, so I've simply added an elog to estimate_num_groups to also print

the estimate or largest group for GROUP BY queries.

It's largely a simplified copy of estimate_num_groups() and there's a

couple of comments of how it might be improved further.

>>

>> I think estimating the largest group we need to sort should be helpful

>> for the incremental sort patch, so I'm adding Alexander to this

>> thread.Agree

> I think so. suggested estimation algorithm should be modified to support

> leading preordered keys and this form could be directly used in GROUP BY

> and incremental sort patches.

Right. What I think the function estimating the group size could do in

case of incremental sort is producing two values - maximum size of the

leading group, and maximum group size overall. The first one would be

useful for startup cost, the second one for total cost.

regards

--

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment | Content-Type | Size |
---|---|---|

estimate-largest-group.patch | text/x-patch | 10.1 KB |

- Re: cost_sort() improvements at 2018-07-12 15:00:27 from Teodor Sigaev

From | Date | Subject | |
---|---|---|---|

Next Message | Tomas Vondra | 2018-07-22 20:22:10 | Re: cost_sort() improvements |

Previous Message | Fabien COELHO | 2018-07-22 20:13:20 | Re: pgbench: improve --help and --version parsing |