Quick Links

Re: ANALYZE sampling is too good

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: ANALYZE sampling is too good
Date:	2013-12-08 19:03:02
Message-ID:	52A4C266.50000@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 12/08/2013 10:14 AM, Greg Stark wrote:
> With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123
> With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125
> With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
> With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213
>
> I'm not really sure if this is due to the blocky sample combined with
> the skewed pgbench run or not. It doesn't seem to be consistently
> biasing towards or against bid 1 which I believe are the only rows
> that would have been touched by pgbench. Still it's suspicious that
> they seem to be consistently getting less accurate as the blockiness
> increases.

They will certainly do so if you don't apply any statistical adjustments
for selecting more rows from the same pages.

So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block. That's what I meant by
"block-based sampling"; you read, say, 400 pages, you compile statistics
on *all* of the rows on those pages, you apply some algorithms to adjust
for groupings of rows based on how grouped they are. And you have a
pretty good estimate of how grouped they are, because you just looked a
complete sets of rows on a bunch of nonadjacent pages.

Obviously, you need to look at more rows than you would with a
pure-random sample. Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample. However, since those rows come from the
same pages, the cost of looking at more rows is quite small, compared to
the cost of looking at 64 times as many disk pages.

My ACM subscription has lapsed, though; someone with a current ACM
subscription could search for this; there are several published papers,
with math and pseudocode.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Re: ANALYZE sampling is too good at 2013-12-08 18:14:51 from Greg Stark

Responses

Re: ANALYZE sampling is too good at 2013-12-08 21:15:09 from Greg Stark
Re: ANALYZE sampling is too good at 2013-12-08 22:19:48 from Mark Kirkwood

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2013-12-08 19:49:43	Re: ANALYZE sampling is too good
Previous Message	Pavel Stehule	2013-12-08 18:54:57	Re: Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist