Re: Estimating number of distinct values.

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Estimating number of distinct values.
Date: 2018-10-24 14:47:37
Message-ID: CAMkU=1wzo9B+6BXYMH0btTYudv9-cigPdd-64Exr852oYL_2mw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 24, 2018 at 10:07 AM Konstantin Knizhnik <
k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

>
> Real number of distinct value for this dataset is about 10 millions. For
> some reasons, sampling using random blocks and Vitter algorithm produces
> worser results than just examining first 30000 rows of the table:

It is a known problem with our sampling method that once a block is chosen,
it then oversamples rows from that block[1]. So you get too many blocks
with 0 rows sampled or 2 or more rows samples, and too few with exactly one
row sampled. If rows with the same value are clustered together into same
blocks, this will find too many duplicates and really skew the Duj1
estimate, because we feeding it a biased sample.

Tomas was working on a patch to make the sampling truly random[2], but I
think he abandoned it to work on the multivariate statistics instead. It
is hard to tell if the IO implications of no longer reading sampled blocks
in physical order would be acceptable, because everyone has different
hardware, data, and ideas of what is acceptable.

[1]
https://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

[2] https://www.postgresql.org/message-id/5588D644.1000909%402ndquadrant.com

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hironobu SUZUKI 2018-10-24 14:55:57 Re: pgbench - add pseudo-random permutation function
Previous Message David G. Johnston 2018-10-24 14:40:00 Re: JSON validation behavior