Re: speeding up planning with partitions

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: Imai Yoshikazu <yoshikazu_i443(at)live(dot)jp>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speeding up planning with partitions
Date: 2019-03-31 17:04:27
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Amit Langote <amitlangote09(at)gmail(dot)com> writes:
> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu <yoshikazu_i443(at)live(dot)jp> wrote:
>> Certainly, using bitmapset contributes to the performance when scanning
>> one partition(few partitions) from large partitions.

> Thanks Imai-san for testing.

I tried to replicate these numbers with the code as-committed, and
could not. What I get, using the same table-creation code as you
posted and a pgbench script file like

\set param random(1, :N)
select * from rt where a = :param;

is scaling like this:

N tps, range tps, hash

2 10520.519932 10415.230400
8 10443.361457 10480.987665
32 10341.196768 10462.551167
128 10370.953849 10383.885128
512 10207.578413 10214.049394
1024 10042.794340 10121.683993
4096 8937.561825 9214.993778
8192 8247.614040 8486.728918

If I use "-M prepared" the numbers go up a bit for lower N, but
drop at high N:

N tps, range tps, hash

2 11449.920527 11462.253871
8 11530.513146 11470.812476
32 11372.412999 11450.213753
128 11289.351596 11322.698856
512 11095.428451 11200.683771
1024 10757.646108 10805.052480
4096 8689.165875 8930.690887
8192 7301.609147 7502.806455

Digging into that, it seems like the degradation with -M prepared is
mostly in LockReleaseAll's hash_seq_search over the locallock hash table.
What I think must be happening is that with -M prepared, at some point the
plancache decides to try a generic plan, which causes opening/locking all
the partitions, resulting in permanent bloat in the locallock hash table.
We immediately go back to using custom plans, but hash_seq_search has
more buckets to look through for the remainder of the process' lifetime.

I do see some cycles getting spent in apply_scanjoin_target_to_paths
that look to be due to scanning over the long part_rels array,
which your proposal would ameliorate. But (a) that's pretty small
compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths
is a remarkable display of brute force inefficiency to begin with.
I think we should see if we can't nuke that function altogether in
favor of generating the paths with the right target the first time.

BTW, the real elephant in the room is the O(N^2) cost of creating
these tables in the first place. The runtime for the table-creation
scripts looks like

N range hash

2 0m0.011s 0m0.011s
8 0m0.015s 0m0.014s
32 0m0.032s 0m0.030s
128 0m0.132s 0m0.099s
512 0m0.969s 0m0.524s
1024 0m3.306s 0m1.442s
4096 0m46.058s 0m15.522s
8192 3m11.995s 0m58.720s

This seems to be down to the expense of doing RelationBuildPartitionDesc
to rebuild the parent's relcache entry for each child CREATE TABLE.
Not sure we can avoid that, but maybe we should consider adopting a
cheaper-to-read representation of partition descriptors. The fact that
range-style entries seem to be 3X more expensive to load than hash-style
entries is strange.

regards, tom lane

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2019-03-31 17:52:31 Re: Google Summer of Code: question about GiST API advancement project
Previous Message Nikolay Petrov 2019-03-31 16:51:54 Re: [HACKERS][Proposal] LZ4 Compressed Storage Manager