Re: Yet another abort-early plan disaster on 9.3

From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "Greg Stark" <stark(at)mit(dot)edu>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Peter Geoghegan" <peter(dot)geoghegan86(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Merlin Moncure" <mmoncure(at)gmail(dot)com>, "postgres performance list" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Yet another abort-early plan disaster on 9.3
Date: 2014-10-10 12:10:13
Message-ID: f4c06746b31e468f9550fb9924798da3.squirrel@2.emaily.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one. While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

Would be helpful to get a collection of test cases that currently perform
poorly. I have collected a few from the archives, but if those who follow
this thread can provide additional test cases / point to a thread
describing related etc. that'd be great.

It certainly won't be perfect, but if it considerably improves the
estimates then I believe it's step forward. Ultimately, it's impossible to
improve the estimates without increasing the sample size.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-10-10 12:25:06 Re: Column Redaction
Previous Message Amit Kapila 2014-10-10 11:48:46 Re: Wait free LW_SHARED acquisition - v0.9

Browse pgsql-performance by date

  From Date Subject
Next Message Craig James 2014-10-10 14:21:05 Re: Yet another abort-early plan disaster on 9.3
Previous Message Greg Stark 2014-10-10 11:16:08 Re: Yet another abort-early plan disaster on 9.3