From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|

To: | pgsql-hackers(at)postgresql(dot)org |

Subject: | PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3) |

Date: | 2014-12-07 01:54:14 |

Message-ID: | 5483B346.6000203@fuzzy.cz |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers pgsql-performance |

Hi!

This was initially posted to pgsql-performance in this thread:

http://www.postgresql.org/message-id/5472416C.3080506@fuzzy.cz

but pgsql-hackers seems like a more appropriate place for further

discussion.

Anyways, attached is v3 of the patch implementing the adaptive ndistinct

estimator. Just like the previous version, the original estimate is the

one stored/used, and the alternative one is just printed, to make it

possible to compare the results.

Changes in this version:

1) implementing compute_minimal_stats

- So far only the 'scalar' (more common) case was handled.

- The algorithm requires more detailed input data, the MCV-based

stats insufficient, so the code hashes the values and then

determines the f1, f2, ..., fN coefficients by sorting and

walking the array of hashes.

2) handling wide values properly (now are counted into f1)

3) compensating for NULL values when calling optimize_estimate

- The estimator has no notion of NULL values, so it's necessary to

remove them both from the total number of rows, and sampled rows.

4) some minor fixes and refactorings

I also repeated the tests comparing the results to the current estimator

- full results are at the end of the post.

The one interesting case is the 'step skew' with statistics_target=10,

i.e. estimates based on mere 3000 rows. In that case, the adaptive

estimator significantly overestimates:

values current adaptive

------------------------------

106 99 107

106 8 6449190

1006 38 6449190

10006 327 42441

I don't know why I didn't get these errors in the previous runs, because

when I repeat the tests with the old patches I get similar results with

a 'good' result from time to time. Apparently I had a lucky day back

then :-/

I've been messing with the code for a few hours, and I haven't found any

significant error in the implementation, so it seems that the estimator

does not perform terribly well for very small samples (in this case it's

3000 rows out of 10.000.000 (i.e. ~0.03%).

However, I've been able to come up with a simple way to limit such

errors, because the number of distinct values is naturally bounded by

(totalrows / samplerows) * ndistinct

where ndistinct is the number of distinct values in the sample. This

essentially means that if you slice the table into sets of samplerows

rows, you get different ndistinct values.

BTW, this also fixes the issue reported by Jeff Janes on 21/11.

With this additional sanity check, the results look like this:

values current adaptive

------------------------------

106 99 116

106 8 23331

1006 38 96657

10006 327 12400

Which is much better, but clearly still a bit on the high side.

So either the estimator really is a bit unstable for such small samples

(it tends to overestimate a bit in all the tests), or there's a bug in

the implementation - I'd be grateful if someone could peek at the code

and maybe compare it to the paper describing the estimator. I've spent a

fair amount of time analyzing it, but found nothing.

But maybe the estimator really is unstable for such small samples - in

that case we could probably use the current estimator as a fallback.

After all, this only happens when someone explicitly decreases the

statistics target to 10 - with the default statistics target it's damn

accurate.

kind regards

Tomas

statistics_target = 10

======================

a) smooth skew, 101 values, different skew ('k')

- defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive

-----------------------

1 10231 11259

2 6327 8543

3 4364 7707

4 3436 7052

5 2725 5868

6 2223 5071

7 1979 5011

8 1802 5017

9 1581 4546

c) step skew (different numbers of values)

values current adaptive

------------------------------

106 99 107

106 8 6449190

1006 38 6449190

10006 327 42441

patched:

values current adaptive

------------------------------

106 99 116

106 8 23331

1006 38 96657

10006 327 12400

statistics_target = 100

=======================

a) smooth skew, 101 values, different skew ('k')

- defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive

-----------------------------

1 10011 10655

2 9641 10944

3 8837 10846

4 8315 10992

5 7654 10760

6 7162 10524

7 6650 10375

8 6268 10275

9 5871 9783

c) step skew (different numbers of values)

values current adaptive

------------------------------

106 30 70

1006 271 1181

10006 2804 10312

statistics_target = 1000

========================

a) smooth skew, 101 values, different skew ('k')

- defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive

---------------------------

3 10001 10002

4 10000 10003

5 9996 10008

6 9985 10013

7 9973 10047

8 9954 10082

9 9932 10100

c) step skew (different numbers of values)

values current adaptive

------------------------------

106 105 113

1006 958 1077

10006 9592 10840

Attachment | Content-Type | Size |
---|---|---|

ndistinct-estimator-v3.patch | text/x-diff | 12.1 KB |

- Re: Yet another abort-early plan disaster on 9.3 at 2014-11-23 20:19:56 from Tomas Vondra

- Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3) at 2014-12-23 10:28:42 from Heikki Linnakangas
- Re: PATCH: adaptive ndistinct estimator v4 at 2015-03-31 19:02:29 from Tomas Vondra

From | Date | Subject | |
---|---|---|---|

Next Message | Tomas Vondra | 2014-12-07 03:08:28 | PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching |

Previous Message | Kouhei Kaigai | 2014-12-06 23:21:58 | Re: [v9.5] Custom Plan API |

From | Date | Subject | |
---|---|---|---|

Next Message | Tim Dudgeon | 2014-12-07 22:59:38 | querying with index on jsonb slower than standard column. Why? |

Previous Message | Bruce Momjian | 2014-12-06 13:08:28 | Re: intel s3500 -- hot stuff |