Re: PATCH: adaptive ndistinct estimator v4

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: adaptive ndistinct estimator v4
Date: 2015-04-20 05:13:37
Message-ID: CAMkU=1wxHZEWGh98rg-2wq1HPbWxs6Cf_MTwHTg92gT=QOdZcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Tue, Apr 14, 2015 at 11:45 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Tue, Mar 31, 2015 at 12:02 PM, Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> Hi all,
>>
>> attached is v4 of the patch implementing adaptive ndistinct estimator.
>>
>
> Hi Tomas,
>
> I have a case here where the adaptive algorithm underestimates ndistinct
> by a factor of 7 while the default estimator is pretty close.
>
> 5MB file:
>
>
> https://drive.google.com/file/d/0Bzqrh1SO9FcETU1VYnQxU2RZSWM/view?usp=sharing
>
> # create table foo2 (x text);
> # \copy foo2 from program 'bzcat ~/temp/foo1.txt.bz2'
> # analyze verbose foo2;
> INFO: analyzing "public.foo2"
> INFO: "foo2": scanned 6021 of 6021 pages, containing 1113772 live rows
> and 0 dead rows; 30000 rows in sample, 1113772 estimated total rows
> WARNING: ndistinct estimate current=998951.78 adaptive=135819.00
>

I've done a more complete analysis with a real world database I have access
to.

I've analyzed patched and current with default_statistics_target of 100 and
10000. (I also have some of the same data under 9.2, but that is not
meaningfully different than unpatched head).

For easier interpretation I hacked up the analyzer so that it just reports
the estimated number, never converting to the negative fraction.

See the spreadsheet here:

https://docs.google.com/spreadsheets/d/1qUcBoQkRFFcSDq7GtkiQkHqlLTbxQYl5hh6S0byff2M/edit?usp=sharing

The 10000 target was initially collected in an attempt to discern the truth
when the 100 target methods disagreed, but I decided to just collect the
gold-standard truth.

The truth is given by:
select count(*) from (select distinct column from schema.table where column
is not null) foo;

And number_not_null is given by:
select count(*) from schema.table where column is not null;

It looks like the proposed method sometimes overestimates, although never
by a huge amount, while the old one never overestimated. Overall it mostly
seems to be more accurate, but occasionally it does substantially worse
than the current method. I suspect most of the problems are related to the
same issue reported in the last email. There are a lot of places where
both underestimate, but where the new method does so by less than head.

If there are any columns anyone wants to examine further, give me the token
for it and I'll try to create a generator that generates data with the same
distribution (and clustering, if that seems relevant).

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-04-20 05:38:09 Re: optimizing vacuum truncation scans
Previous Message Peter Geoghegan 2015-04-20 04:37:51 Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

Browse pgsql-performance by date

  From Date Subject
Next Message Jim Nasby 2015-04-22 16:15:12 Re: extract(year from date) doesn't use index but maybe could?
Previous Message Gavin Flower 2015-04-20 00:13:12 Re: extract(year from date) doesn't use index but maybe could?