Re: ANALYZE sampling is too good

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 02:11:45
Message-ID: CAMkU=1weFZ-k=z2Utu=kTHe7R5eqR45ujWdNVGC+UHU7n+RZNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, December 10, 2013, Simon Riggs wrote:

> On 11 December 2013 00:28, Greg Stark <stark(at)mit(dot)edu <javascript:;>>
> wrote:
> > On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon(at)2ndquadrant(dot)com<javascript:;>>
> wrote:
> >> Block sampling, with parameter to specify sample size. +1
> >
> > Simon this is very frustrating. Can you define "block sampling"?
>
> Blocks selected using Vitter's algorithm, using a parameterised
> fraction of the total.
>

OK, thanks for defining that.

We only need Vitter's algorithm when we don't know in advance how many
items we are sampling from (such as for tuples--unless we want to rely on
the previous estimate for the current round of analysis). But for blocks,
we do know how many there are, so there are simpler ways to pick them.

>
> When we select a block we should read all rows on that block, to help
> identify the extent of clustering within the data.
>

But we have no mechanism to store such information (or to use it if it were
stored), nor even ways to prevent the resulting skew in the sample from
seriously messing up the estimates which we do have ways of storing and
using.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2013-12-11 02:33:46 Re: Why we are going to have to go DirectIO
Previous Message Tom Lane 2013-12-11 02:03:52 Re: ANALYZE sampling is too good