Re: Reducing planning time of large IN queries on primary key / unique columns

From: Souvik Bhattacherjee <pgsdbhacker(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reducing planning time of large IN queries on primary key / unique columns
Date: 2022-08-11 22:41:27
Message-ID: CAJCGbZMegzV3hMToqvsatMU5PAtFj414nPkrL2g1FDaHzPOxHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

(Sorry about the re-post. Another attempt at fixing the formatting)

At ServiceNow, we frequently encounter queries with very large IN lists
where the number of elements in the IN list range from a few hundred to
several thousand. For a significant fraction of the queries, the IN clauses
are constructed on primary key columns. While planning these queries,
Postgres query planner loops over every element in the IN clause, computing
the selectivity of each element and then uses that as an input to compute
the total selectivity of the IN clause. For IN clauses on primary key or
unique columns, it is easy to see that the selectivity of the IN predicate
is given by (number of elements in the IN clause / table cardinality) and
is independent of the selectivity of the individual elements. We use this
observation to avoid computing the selectivities of the individual
elements. This results in an improvement in the planning time especially
when the number of elements in the IN clause is relatively large.

The table below demonstrates the improvement in planning time in
milliseconds (averaged over 3 runs) for IN queries of the form SELECT
COUNT(*) FROM table_a WHERE sys_id IN ('000356e61b568510eabcca2b234bcb08', '
00035846db2f24101ad7f256b9961925', ...). Here sys_id is the primary key
column of type VARCHAR(32) and the table cardinality of table_a is around
10M.

# IN elements | Plan time w/o opt | Plan time w/ opt | Speedup

-------------------
|-------------------------|-----------------------|--------------

500 | 0.371 | 0.236 |
1.57

5000 | 2.019 | 0.874 |
2.31

50000 | 19.886 | 8.273 | 2.40

Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
or unique column can be computed by not computing the selectivities of
individual elements. The query used is of the form SELECT COUNT(*) FROM
table_a WHERE sys_id NOT IN ('000356e61b568510eabcca2b234bcb08', '
00035846db2f24101ad7f256b9961925', ...).

# NOT IN elements | Plan time w/o opt | Plan time w/ opt | Speedup

---------------------------|-------------------------|----------------------|--------------

500 | 0.380 | 0.248
| 1.53

5000 | 2.534 | 0.854
| 2.97

50000 | 21.316 | 9.009
| 2.36

We also obtain planning time of queries on a primary key column of type
INTEGER with 10M elements for both IN and NOT in queries.

# IN elements | Plan time w/o opt | Plan time w/ opt | Speedup

--------------------|-------------------------|----------------------|--------------

500 | 0.370 | 0.208 |
1.78

5000 | 1.998 | 0.816 |
2.45

50000 | 18.073 | 6.750 | 2.67

# NOT IN elements | Plan time w/o opt | Plan time w/ opt | Speedup

--------------------------
|------------------------|------------------------|--------------

500 | 0.342 | 0.203
| 1.68

5000 | 2.073 | 0.822
| 3.29

50000 |19.551 | 6.738
| 2.90

We see that the planning time of queries on unique columns are identical to
that we observed for primary key columns. The resulting patch file for the
changes above is small and we are happy to polish it up and share.

Best,

Souvik Bhattacherjee

(ServiceNow)

On Thu, Aug 11, 2022 at 3:04 PM Souvik Bhattacherjee <pgsdbhacker(at)gmail(dot)com>
wrote:

> (Re-posting with better formatting)
>
> Hi hackers,
>
> At ServiceNow, we frequently encounter queries with very large IN lists
> where the number of elements in the IN list range from a
>
> few hundred to several thousand. For a significant fraction of the
> queries, the IN clauses are constructed on primary key columns.
>
> While planning these queries, Postgres query planner loops over every
> element in the IN clause, computing the selectivity of each
>
> element and then uses that as an input to compute the total selectivity of
> the IN clause. For IN clauses on primary key or unique
>
> columns, it is easy to see that the selectivity of the IN predicate is
> given by (number of elements in the IN clause / table cardinality)
>
> and is independent of the selectivity of the individual elements. We use
> this observation to avoid computing the selectivities of the
>
> individual elements. This results in an improvement in the planning time
> especially when the number of elements in the IN clause
>
> is relatively large.
>
>
>
> The table below demonstrates the improvement in planning time (averaged
> over 3 runs) for IN queries of the form
>
> SELECT COUNT(*) FROM table_a WHERE sys_id IN
> ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925',
> ...).
>
> Here sys_id is the primary key column of type VARCHAR(32) and the table
> cardinality of table_a is around 10M.
>
>
> Number of IN list elements | Planning time w/o optimization (in ms) | Planning
> time w/ optimization (in ms) | Speedup
>
> ------------------------------------|---------------------------------------------------
> | --------------------------------------------------|--------------
>
> 500 | 0.371
> | 0.236
> | 1.57
>
> 5000 | 2.019
> | 0.874
> | 2.31
>
> 50000 | 19.886
> | 8.273
> | 2.40
>
>
>
> Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
> or unique column can be computed by not computing the
>
> selectivities of individual elements. The query used is of the form SELECT
> COUNT(*) FROM table_a WHERE sys_id NOT IN
>
> ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925',
> ...).
>
>
> Number of NOT IN list elements | Planning time w/o optimization (in ms) | Planning
> time w/ optimization (in ms) | Speedup
>
> -------------------------------------------|---------------------------------------------------
> | --------------------------------------------------|--------------
>
> 500 | 0.380
> | 0.248
> | 1.53
>
> 5000 | 2.534
> | 0.854
> | 2.97
>
> 50000 | 21.316
> | 9.009
> | 2.36
>
>
>
> We also obtain planning time of queries on a primary key column of type
> INTEGER with 10M elements for both IN and NOT in queries.
>
>
> Number of IN list elements | Planning time w/o optimization (in ms) | Planning
> time w/ optimization (in ms) | Speedup
>
> ------------------------------------|---------------------------------------------------
> | --------------------------------------------------|--------------
>
> 500 | 0.370
> | 0.208
> | 1.78
>
> 5000 | 1.998
> | 0.816
> | 2.45
>
> 50000 | 18.073
> | 6.750
> | 2.67
>
>
> Number of NOT IN list elements | Planning time w/o optimization (in ms) | Planning
> time w/ optimization (in ms) | Speedup
>
> -------------------------------------------|---------------------------------------------------
> | --------------------------------------------------|--------------
>
> 500 | 0.342
> | 0.203
> | 1.68
>
> 5000 | 2.073
> | 0.822
> | 3.29
>
> 50000 |19.551
> | 6.738
> | 2.90
>
>
>
> We see that the planning time of queries on unique columns are identical
> to that we observed for primary key columns.
>
> The resulting patch file for the changes above is small and we are happy
> to polish it up and share.
>
>
> Best,
>
> Souvik Bhattacherjee
>
> (ServiceNow)
>
> On Thu, Aug 11, 2022 at 2:42 PM Souvik Bhattacherjee <
> pgsdbhacker(at)gmail(dot)com> wrote:
>
>> Hi hackers,
>>
>> At ServiceNow, we frequently encounter queries with very large IN lists
>> where the number of elements in the IN list range from a few hundred to
>> several thousand. For a significant fraction of the queries, the IN clauses
>> are constructed on primary key columns. While planning these queries,
>> Postgres query planner loops over every element in the IN clause, computing
>> the selectivity of each element and then uses that as an input to compute
>> the total selectivity of the IN clause. For IN clauses on primary key or
>> unique columns, it is easy to see that the selectivity of the IN predicate
>> is given by (number of elements in the IN clause / table cardinality) and
>> is independent of the selectivity of the individual elements. We use this
>> observation to avoid computing the selectivities of the individual
>> elements. This results in an improvement in the planning time especially
>> when the number of elements in the IN clause is relatively large.
>>
>>
>>
>> The table below demonstrates the improvement in planning time (averaged
>> over 3 runs) for IN queries of the form SELECT COUNT(*) FROM table_a WHERE
>> sys_id IN ('000356e61b568510eabcca2b234bcb08',
>> '00035846db2f24101ad7f256b9961925', ...). Here sys_id is the primary key
>> column of type VARCHAR(32) and the table cardinality of table_a is around
>> 10M.
>>
>>
>>
>> Number of IN list elements
>>
>> Planning time w/o optimization (in ms)
>>
>> Planning time w/ optimization (in ms)
>>
>> Speedup
>>
>> 500
>>
>> 0.371
>>
>> 0.236
>>
>> 1.57
>>
>> 5000
>>
>> 2.019
>>
>> 0.874
>>
>> 2.31
>>
>> 50000
>>
>> 19.886
>>
>> 8.273
>>
>> 2.40
>>
>>
>>
>> Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
>> or unique column can be computed by not computing the selectivities of
>> individual elements. The query used is of the form SELECT COUNT(*) FROM
>> table_a WHERE sys_id NOT IN ('000356e61b568510eabcca2b234bcb08',
>> '00035846db2f24101ad7f256b9961925', ...).
>>
>>
>>
>> Number of NOT IN list elements
>>
>> Planning time w/o optimization (in ms)
>>
>> Planning time w/ optimization (in ms)
>>
>> Speedup
>>
>> 500
>>
>> 0.380
>>
>> 0.248
>>
>> 1.53
>>
>> 5000
>>
>> 2.534
>>
>> 0.854
>>
>> 2.97
>>
>> 50000
>>
>> 21.316
>>
>> 9.009
>>
>> 2.36
>>
>>
>>
>> We also obtain planning time of queries on a primary key column of type
>> INTEGER with 10M elements for both IN and NOT in queries.
>>
>>
>> Number of IN list elements
>>
>> Planning time w/o optimization (in ms)
>>
>> Planning time w/ optimization (in ms)
>>
>> Speedup
>>
>> 500
>>
>> 0.370
>>
>> 0.208
>>
>> 1.78
>>
>> 5000
>>
>> 1.998
>>
>> 0.816
>>
>> 2.45
>>
>> 50000
>>
>> 18.073
>>
>> 6.750
>>
>> 2.67
>>
>>
>>
>>
>> Number of NOT IN list elements
>>
>> Planning time w/o optimization (in ms)
>>
>> Planning time w/ optimization (in ms)
>>
>> Speedup
>>
>> 500
>>
>> 0.342
>>
>> 0.203
>>
>> 1.68
>>
>> 5000
>>
>> 2.073
>>
>> 0.822
>>
>> 3.29
>>
>> 50000
>>
>> 19.551
>>
>> 6.738
>>
>> 2.90
>>
>>
>>
>> We see that the planning time of queries on unique columns are identical
>> to that we observed for primary key columns. The resulting patch file for
>> the changes above is small and we are happy to polish it up and share.
>>
>>
>> Best,
>>
>> Souvik Bhattacherjee
>>
>> (ServiceNow)
>>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-08-11 22:47:00 Re: Use SetInstallXLogFileSegmentActive() for setting XLogCtl->InstallXLogFileSegmentActive
Previous Message Jacob Champion 2022-08-11 22:28:31 Re: [PATCH] Expose port->authn_id to extensions and triggers