Re: speeding up planning with partitions

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: Imai Yoshikazu <yoshikazu_i443(at)live(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speeding up planning with partitions
Date: 2019-04-01 04:55:50
Message-ID: 993dedcc-6079-ad40-b087-00d139c00203@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(I've closed the CF entry: https://commitfest.postgresql.org/22/1778/)

On 2019/04/01 2:04, Tom Lane wrote:
> Amit Langote <amitlangote09(at)gmail(dot)com> writes:
>> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu <yoshikazu_i443(at)live(dot)jp> wrote:
>>> Certainly, using bitmapset contributes to the performance when scanning
>>> one partition(few partitions) from large partitions.
>
>> Thanks Imai-san for testing.
>
> I tried to replicate these numbers with the code as-committed, and
> could not.

Thanks for that.

> What I get, using the same table-creation code as you
> posted and a pgbench script file like
>
> \set param random(1, :N)
> select * from rt where a = :param;
>
> is scaling like this:
>
> N tps, range tps, hash
>
> 2 10520.519932 10415.230400
> 8 10443.361457 10480.987665
> 32 10341.196768 10462.551167
> 128 10370.953849 10383.885128
> 512 10207.578413 10214.049394
> 1024 10042.794340 10121.683993
> 4096 8937.561825 9214.993778
> 8192 8247.614040 8486.728918
>
> If I use "-M prepared" the numbers go up a bit for lower N, but
> drop at high N:
>
> N tps, range tps, hash
>
> 2 11449.920527 11462.253871
> 8 11530.513146 11470.812476
> 32 11372.412999 11450.213753
> 128 11289.351596 11322.698856
> 512 11095.428451 11200.683771
> 1024 10757.646108 10805.052480
> 4096 8689.165875 8930.690887
> 8192 7301.609147 7502.806455
>
> Digging into that, it seems like the degradation with -M prepared is
> mostly in LockReleaseAll's hash_seq_search over the locallock hash table.
> What I think must be happening is that with -M prepared, at some point the
> plancache decides to try a generic plan, which causes opening/locking all
> the partitions, resulting in permanent bloat in the locallock hash table.
> We immediately go back to using custom plans, but hash_seq_search has
> more buckets to look through for the remainder of the process' lifetime.

Ah, we did find this to be a problem upthread [1] and Tsunakawa-san then
even posted a patch which is being discussed at:

https://commitfest.postgresql.org/22/1993/

> I do see some cycles getting spent in apply_scanjoin_target_to_paths
> that look to be due to scanning over the long part_rels array,
> which your proposal would ameliorate. But (a) that's pretty small
> compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths
> is a remarkable display of brute force inefficiency to begin with.
> I think we should see if we can't nuke that function altogether in
> favor of generating the paths with the right target the first time.

That's an option if we can make it work.

Shouldn't we look at *all* of the places that have code that now look like
this:

for (i = 0; i < rel->nparts; i++)
{
RelOptInfo *partrel = rel->part_rels[i];

if (partrel == NULL)
continue;
...
}

Beside apply_scanjoin_target_to_paths(), there are:

create_partitionwise_grouping_paths()
make_partitionedrel_pruneinfo()

> BTW, the real elephant in the room is the O(N^2) cost of creating
> these tables in the first place. The runtime for the table-creation
> scripts looks like
>
> N range hash
>
> 2 0m0.011s 0m0.011s
> 8 0m0.015s 0m0.014s
> 32 0m0.032s 0m0.030s
> 128 0m0.132s 0m0.099s
> 512 0m0.969s 0m0.524s
> 1024 0m3.306s 0m1.442s
> 4096 0m46.058s 0m15.522s
> 8192 3m11.995s 0m58.720s
>
> This seems to be down to the expense of doing RelationBuildPartitionDesc
> to rebuild the parent's relcache entry for each child CREATE TABLE.
> Not sure we can avoid that, but maybe we should consider adopting a
> cheaper-to-read representation of partition descriptors. The fact that
> range-style entries seem to be 3X more expensive to load than hash-style
> entries is strange.

I've noticed this many times too, but never prioritized doing something
about it. I'll try sometime.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f-dn1hDZqObwdMrYdV7-cELJwWCPRWet6EQX_WaV8JLgw%40mail.gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2019-04-01 05:26:15 Re: New vacuum option to do only freezing
Previous Message Yugo Nagata 2019-04-01 03:11:22 Re: Implementing Incremental View Maintenance