Quick Links

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date:	2015-06-22 15:24:02
Message-ID:	55882892.9040100@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 06/22/2015 07:21 AM, Jeff Janes wrote:
> On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> Hi,
>
> On 06/20/2015 03:01 AM, Jeff Janes wrote:
>
>
>
> I don't think we need to really assume the density to be
> constant,
> maybe we can verify that while collecting the sample? I
> mean, we're
> already reading the pages, so we can track the density, and
> either
> do the correction or not.
>
>
> Maybe. I don't know how that would work.
>
>
> I was thinking about something like
>
> (1) compute average tuples per page
>
> (2) generate random sample of 'samplesize' blocks
>
> (3) for each block, see what is the actual number of tuples on the
> page and perform correction (e.g. if there's only 1/2 the average
> tuples, toss a coin and decide whether to really sample the
> block)
>
> (4) repeat until you have sample of the requested size
>
>
> Are you producing a block sample or a row sample? If a block is
> sampled, then what do you feed it to?

I'm not sure what's the question. The ultimate goal is to get a random
row sample, but we need to choose which blocks to sample first. So you
can just generate K random numbers (where K is the requested sample
size). If you get the block number once, sample one row, if you get the
block twice, sample two rows, etc.

Of course, this is based on the assumption of uniform tuple density,
which allows you to choose block first and then independently a row from
that block. Hence the correction idea I outlined in the previous message.

> Are you just picking one row out of each block that survives step 3?
> If so, that would be similar to my idea of counting a given value
> only once per block it was in, except I was thinking of applying that
> only to n_distinct, not to the entire stats collection process.

No, I'm not doing that, and I think it's a bad idea in general. The
problem with the current sampling is that it does not produce a random
sample of rows, but something skewed, which causes serious trouble in
the estimators processing the sample. I think this 'count only once'
would result in similar issues (e.g. for low-cardinality columns).

>
> My answer was to take out the block sampling entirely and read the
> whole table. That is what probably won't work for you. (Come to think
> of it, I was hesitant to deploy custom code to production, so I never
> actually deployed that. Instead I cranked up the stats target and let
> the chips fall where they may)

Yeah, reading the whole table is not going to fly I guess. But the
estimates would be very accurate! ;-)

>
> I think you want to go back to the table to find new blocks to
> replace ones which were included in the original block sample but
> ended up having no tuples in the end tuple sample, either because
> they were low density blocks, or just through the luck of the draw.
> But I don't see how fixes the problem, unless you also prevent a
> block from contributing more than 1 tuple. And at that point, I worry
> about the uneven density again. If a block has 2 rows selected by
> pure random chance, I think it would be fine to keep only one (or, go
> find another random block to pull one row from as a replacement). But
> if it has 2 rows selected because it has more rows than average, then
> we would have to pull the replacement from a random block *of similar
> density* to avoid swapping one bias for another one.

Hmmm, maybe limiting the sample to just 1 tuple is a good idea, but I
have to think about that a bit more.

The basic idea behind density correction is that when you get a block
with density significantly different from the average (let's say more
than 10%?), you can either treat it as a partial block (density below
average), or a collection of multiple blocks (density above average),
and see if it really sampled.

Handling the 'partial block' seems rather simple - if the block has half
the average density, treat do a random coin toss and either keep or
discard the tuple.

With high density blocks, it's probably a complicated, and I think it
really means treating it as multiple 'virtual' blocks, and only keep
samples from the first one.

And of course, this needs to be repeated if some of the rows get evicted
because of the correction mechanism. Which will lead to slightly larger
samples, but I don't see that as a major problem (and the difference is
not that big, at least not in the case discussed in this thread previously).

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H at 2015-06-22 05:21:22 from Jeff Janes

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Prakash Itnal	2015-06-22 17:35:38	Re: Auto-vacuum is not running in 9.1.12
Previous Message	Sandro Santilli	2015-06-22 15:11:38	PGXS "check" target forcing an install ?