Quick Links

Re: Yet another abort-early plan disaster on 9.3

From:	"Tomas Vondra" <tv(at)fuzzy(dot)cz>
To:	"Greg Stark" <stark(at)mit(dot)edu>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Peter Geoghegan" <peter(dot)geoghegan86(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Merlin Moncure" <mmoncure(at)gmail(dot)com>, "postgres performance list" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-10 12:10:13
Message-ID:	f4c06746b31e468f9550fb9924798da3.squirrel@2.emaily.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one. While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

Would be helpful to get a collection of test cases that currently perform
poorly. I have collected a few from the archives, but if those who follow
this thread can provide additional test cases / point to a thread
describing related etc. that'd be great.

It certainly won't be perfect, but if it considerably improves the
estimates then I believe it's step forward. Ultimately, it's impossible to
improve the estimates without increasing the sample size.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

regards
Tomas

In response to

Re: Yet another abort-early plan disaster on 9.3 at 2014-10-10 11:16:08 from Greg Stark

Responses

Re: Yet another abort-early plan disaster on 9.3 at 2014-10-10 14:21:05 from Craig James
Re: Yet another abort-early plan disaster on 9.3 at 2014-10-10 17:53:36 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2014-10-10 12:25:06	Re: Column Redaction
Previous Message	Amit Kapila	2014-10-10 11:48:46	Re: Wait free LW_SHARED acquisition - v0.9

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Craig James	2014-10-10 14:21:05	Re: Yet another abort-early plan disaster on 9.3
Previous Message	Greg Stark	2014-10-10 11:16:08	Re: Yet another abort-early plan disaster on 9.3