Re: Yet another abort-early plan disaster on 9.3

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Yet another abort-early plan disaster on 9.3
Date: 2014-10-03 19:58:46
Message-ID: CAMkU=1z4GBD2re0_5kpYrPCi9Rdj-9p_tuUE2J5nA++sbmTx0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> On 10/02/2014 02:30 AM, Peter Geoghegan wrote:
> > On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon(at)2ndquadrant(dot)com>
> wrote:
> >> Having read papers on it, I believe the problem is intractable. Coding
> >> is not the issue. To anyone: please prove me wrong, in detail, with
> >> references so it can be coded.
> >
> > I think it might be close to intractable if you're determined to use a
> > sampling model. HyperLogLog looks very interesting for n_distinct
> > estimation, though. My abbreviated key patch estimates the cardinality
> > of abbreviated keys (and original strings that are to be sorted) with
> > high precision and fixed overhead. Maybe we can figure out a way to
> > do opportunistic streaming of HLL. Believe it or not, the way I use
> > HLL for estimating cardinality is virtually free. Hashing is really
> > cheap when the CPU is bottlenecked on memory bandwidth.
>
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one. While n_distinct will still have some
> variance, it'll be over a much smaller range.
>

In my hands, the problems with poor n_distinct were not due to the
insufficient size of the sample, but the insufficient randomness of it.
Increasing default_statistics_target did help but not because it increases
the number of rows sampled, but rather because it increases the number of
blocks sampled. Once substantially all of the blocks are part of the block
sampling, the bias is eliminated even though it was still sampling a small
fraction of the rows (roughly one per block).

So one idea would be go get rid of the 2-stage sampling algorithm (sample
blocks, sample rows from the chosen blocks) and just read the whole table
and sample rows from it unbiased, at least under some conditions. Some low
level benchmarking on my favorite server showed that reading 1% of a
system's blocks (in block number order within each file) was no faster than
reading all of them from an IO perspective. But that is a virtualized
server that wasn't really speced out to be an IO intensive database server
in the first place. It would be interesting to see what people get on real
hardware that they actually designed for the task.

A problem right now is that we only have one knob. I want to compute more
accurate n_distinct and most_common_freqs, but I don't want to store huge
numbers entries for most_common_vals and histogram_bounds.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Brightwell, Adam 2014-10-03 20:05:38 Re: superuser() shortcuts
Previous Message Alvaro Herrera 2014-10-03 19:46:36 Re: replicating DROP commands across servers

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2014-10-03 23:30:34 Re: Yet another abort-early plan disaster on 9.3
Previous Message Claudio Freire 2014-10-03 16:33:18 Re: Planning for Scalability