Re: Perfomance of IN-clause with many elements and possible solutions

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dmitry Lazurkin <dilaz03(at)gmail(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Perfomance of IN-clause with many elements and possible solutions
Date: 2017-07-25 05:09:07
Message-ID: CAKFQuwbR+SyKU_-Qicc_uTQW6gJhwFF+TOco-hw9NPMcKQ9X1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Jul 24, 2017 at 8:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> ​[*docs]
> If the data were perfectly distributed, with the same
> * number of tuples going into each available bucket, then the bucketsize
> * fraction would be 1/nbuckets. But this happy state of affairs will
> occur
> * only if (a) there are at least nbuckets distinct data values, and (b)
> * we have a not-too-skewed data distribution. Otherwise the buckets will
> * be nonuniformly occupied.

​Thanks, I have a better feel now. Using this example (200 inner relation
rows) is pretty poor since at this scale there doesn't seem to be enough
data to make a noticeable difference.

But anyway, the above comment is only being applied when dealing with a
non-unique ​inner relation; however, the fraction used is 1/nbuckets for
any unique relation regardless of its size.

if (IsA(inner_path, UniquePath))
innerbucketsize = 1.0 / virtualbuckets;
else

And to clarify for others only reading this...the 200 on the "VALUES" node
is there because there are 200 literal values in the value_list. The 200
on the resulting Hash (and HashAggregate in the example) node is there
because of DEFAULT_NUM_DISTINCT (changing the query limit to 300 only
changed the former). Further, since it is only the default, the fraction
used charged out is 1/10 instead of 1/200 that would used if the 200 were a
real number instead - or 1/1024 if those 200 rows were known to be
themselves unique.

For me, I'm seeing that the expected number of input rows doesn't factor
into the innerbucketsize computation directly (possibly excepting a scaling
factor adjustment).

I can understand better, now, why this seemingly perfect example of a
semi-join query gets executed with an extra distinct/grouping node.

David J.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Amee Sankhesara - Quipment India 2017-07-25 06:33:12 Major Version Upgradation from 9.4 to 9.6 in Replication Environment
Previous Message Tom Lane 2017-07-25 03:11:09 Re: Perfomance of IN-clause with many elements and possible solutions