From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|

To: | pgsql-performance(at)postgresql(dot)org |

Subject: | Re: Yet another abort-early plan disaster on 9.3 |

Date: | 2014-10-10 17:53:36 |

Message-ID: | 54381D20.9010501@fuzzy.cz |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers pgsql-performance |

On 10.10.2014 14:10, Tomas Vondra wrote:

> Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):

>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

>>> Yes, it's only intractable if you're wedded to the idea of a tiny,

>>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we

>>> can get a MUCH more accurate n_distinct estimate using multiple

>>> algorithms, of which HLL is one. While n_distinct will still have some

>>> variance, it'll be over a much smaller range.

>>

>> I've gone looking for papers on this topic but from what I read this

>> isn't so. To get any noticeable improvement you need to read 10-50% of

>> the table and that's effectively the same as reading the entire table

>> -- and it still had pretty poor results. All the research I could find

>> went into how to analyze the whole table while using a reasonable

>> amount of scratch space and how to do it incrementally.

>

> I think it's really difficult to discuss the estimation without some basic

> agreement on what are the goals. Naturally, we can't get a perfect

> estimator with small samples (especially when the sample size is fixed and

> not scaling with the table). But maybe we can improve the estimates

> without scanning most of the table?

>

> FWIW I've been playing with the adaptive estimator described in [1] and

> the results looks really interesting, IMHO. So far I was testing it on

> synthetic datasets outside the database, but I plan to use it instead of

> our estimator, and do some more tests.

Attached is an experimental patch implementing the adaptive estimator.

It was fairly simple (although it's a bit messy). It only computes the

estimates for the "scalar" case (i.e. data types that we can sort).

Implementing this for the "minimal" case is possible, but requires a bit

more work.

It only computes the estimate and prints a WARNING with both the current

and new estimate, but the old estimate is stored.

I also attach a few synthetic examples of synthetic datasets with

distributions stored in various ways, that I used for testing. In all

cases there's a single table with 10M rows and a single INT column.

There are three kinds of skew:

1) smooth skew

- N distinct values (100, 10.000 and 100.000 values)

- average moves to 0 as 'k' increases ('k' between 1 and 9)

- smooth distribution of frequencies

INSERT INTO test

SELECT pow(random(),k) * 10000 FROM generate_series(1,10000000);

2) step skew

- a few very frequent values, many rare values

- for example this generates 5 very frequent and ~10k rare values

INSERT INTO test

SELECT (CASE WHEN (v < 90000) THEN MOD(v,5) ELSE v END)

FROM (

SELECT (random()*100000)::int AS v

FROM generate_series(1,10000000)

) foo;

Results

=======

I tested this with various statistics target settings (10, 100, 1000),

which translates to different sample sizes.

statistics target 100 (default, 30k rows, 0.3% sample)

======================================================

a) smooth skew, 101 values, different skew ('k')

k current adaptive

-------------------------

1 101 102

3 101 102

5 101 102

7 101 102

9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive

-------------------------

1 9986 10542

3 8902 10883

5 7579 10824

7 6639 10188

9 5947 10013

c) step skew (different numbers of values)

values current adaptive

------------------------------

106 106 107

106 35 104

1006 259 1262

10006 2823 11047

statistics target 10 (3k rows, 0.03% sample)

============================================

a) smooth skew, 101 values, different skew ('k')

k current adaptive

-------------------------

1 101 102

3 101 102

5 101 102

7 101 102

9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive

-------------------------

1 9846 10014

3 4399 7190

5 2532 5477

7 1938 4932

9 1623 1623

c) step skew (different numbers of values)

values current adaptive

------------------------------

106 100 114

106 5 5

1006 37 532

10006 323 20970

statistics target 1000 (300k rows, 3% sample)

=============================================

k current adaptive

-------------------------

1 101 102

3 101 102

5 101 102

7 101 102

9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive

-------------------------

1 10001 10002

3 10000 10000

5 9998 10011

7 9973 10045

9 9939 10114

c) step skew (different numbers of values)

values current adaptive

------------------------------

106 106 107

106 101 107

1006 957 1096

10006 9551 10550

Summary

=======

I'm yet to see an example where the adaptive estimator produces worse

results than the current estimator, irrespectedly of the distribution

and sample size / statistics target.

What really matters is the sample size, with respect to the table size,

so I'll use the 0.03%, 0.3%, and 3% instead of the statistics target.

For the large sample (3%) both estimators produce reasonably accurate

results. This however won't work as the tables grow, because the number

of rows we sample is static (does not grow with the table).

As the sample decreases, the adaptive estimator starts winning. For the

0.3% sample the difference is easily 3x for the high-skew cases. E.g.

for one of the "step skew" distributions the actual ndistinct value is

10006, current estimator gives 2823 while adaptive gives 11047. That's

ratio error ~3.5 vs. 1.1x.

For the tiny 0.03% sample the difference gets even more siginficant,

especially for the step-skew cases, where the improvement is often an

order of magnitude.

Proposal

========

I think the adaptive estimator works very well, and I plan to submit it

to the next commitfest after a bit of polishing. Examples of

distributions that break it are welcome of course.

Also, I think it's clear it'd be useful to be able to scale the sample

proportionally to the table (instead of only allowing the current

statistics target approach). I understand it may result in scanning

large part of the table, but I don't see a problem in that's not a

default behavior (and clearly documented). I don't see a way around that

- small samples simply result in poor estimates.

I was thinking that this could be done using statistics target, but it's

already used for other things (e.g. size of MCV list, histogram) and we

don't want to break that by adding yet another function.

Ideas?

regards

Tomas

Attachment | Content-Type | Size |
---|---|---|

ndistinct-estimator.patch | text/x-diff | 4.8 KB |

ndistinct.sql | application/sql | 3.7 KB |

- Re: Yet another abort-early plan disaster on 9.3 at 2014-10-10 12:10:13 from Tomas Vondra

- Re: Yet another abort-early plan disaster on 9.3 at 2014-11-21 18:38:27 from Jeff Janes

From | Date | Subject | |
---|---|---|---|

Next Message | Craig James | 2014-10-10 17:59:52 | Re: Yet another abort-early plan disaster on 9.3 |

Previous Message | Fabrízio de Royes Mello | 2014-10-10 17:43:38 | Re: schema-only -n option in pg_restore fails |

From | Date | Subject | |
---|---|---|---|

Next Message | Craig James | 2014-10-10 17:59:52 | Re: Yet another abort-early plan disaster on 9.3 |

Previous Message | Tomas Vondra | 2014-10-10 16:53:51 | Re: Yet another abort-early plan disaster on 9.3 |