RE: Parallel INSERT (INTO ... SELECT ...)

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
Cc: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Langote <amitlangote09(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>
Subject: RE: Parallel INSERT (INTO ... SELECT ...)
Date: 2021-02-26 06:26:00
Message-ID: OS0PR01MB5716F34C91F38318B7B26F56949D9@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I add some code to track the time spent in index operation.
> From the results[1], we can see more workers will bring more cost in
> _bt_search_insert() in each worker.
> After debugged, the most cost part is the following:
> -----
> /* drop the read lock on the page, then acquire one on its child
> */
> *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
> -----
>
> It seems the order of parallel bitmap scan's result will result in more lock time in
> parallel insert.
> [1]---------------parallel bitmap scan------------------ worker 0:
> psql:test.sql:10: INFO: insert index _bt_search_insert time:834735
> psql:test.sql:10: INFO: insert index total time:1895330
> psql:test.sql:10: INFO: insert tuple time:628064
>
> worker 2:
> psql:test.sql:10: INFO: insert index _bt_search_insert time:1552242
> psql:test.sql:10: INFO: insert index total time:2374741
> psql:test.sql:10: INFO: insert tuple time:314571
>
> worker 4:
> psql:test.sql:10: INFO: insert index _bt_search_insert time:2496424
> psql:test.sql:10: INFO: insert index total time:3016150
> psql:test.sql:10: INFO: insert tuple time:211741
> ----------------------------
>
> Based on above, I tried to change the order of results that bitmapscan return.
> In the original test, we prepare data in order (like: generate_series(1,10000,1)), If
> we change the order we insert the data in the source table, the performance
> degradation will not always happen[2].
> And table size difference will be small.
>
> -------------------out of order source table-----------------------------
> insert into x(a,b,c) select i,i+1,i+2 from generate_series(1,600000000) as t(i)
> order by random();
> ----------------------------------------------------------------------------
>
> Test results when source table out of order(using bitmap heap scan):
> [2]--------------------------------------------------------
> Worker 0:
> Execution Time: 37028.006 ms
> Worker 2:
> Execution Time: 11355.153 ms
> Worker 4:
> Execution Time: 9273.398 ms
> --------------------------------------------------------
>
> So, this performance degradation issue seems related on the order of the data
> in the source table.
> It does not always happen. Do we need to do some specific fix for it ?

After doing some more tests on it (performance degradation will not happen when source table is out of order).
I think we can say the performance degradation is related to the order of the data in source table.
Also , In master branch, I found the order of data in source table will not affect the planner when generating plan(for select part).

As we can see from [1][2], If source table's data is in order, when set max_parallel_workers_per_gather to 4,
the planner will choose parallel seqscan here but it is actually slower than serial bitmapscan.
If data in source table is out of order, the performance degradation will not happen again[3][4].

So, the order of data 's influence seems a normal phenomenon, I think it seems we do not need to do anything about it (currently).
It seems better to mark it as todo which we can improve this in the future.

(Since the performance degradation in parallel bitmap is because of the lock in _bt_search, It will not always happen
when the target table already have data, less race condition will happened when parallel insert into a evenly distributed btree).

Test result with sql: "postgres=# explain analyze verbose select a from x where a<80000 or (a%2=0 and a>199000000);"

[1]-----------Source table data in order and max_parallel_workers_per_gather = 0-----------
Bitmap Heap Scan on public.x (cost=22999.40..4991850.30 rows=91002 width=4) (actual time=45.445..201.322 rows=579999 loops=1)
Output: a
Recheck Cond: ((x.a < 80000) OR (x.a > 199000000))
Filter: ((x.a < 80000) OR (((x.a % 2) = 0) AND (x.a > 199000000)))
Rows Removed by Filter: 500000
Heap Blocks: exact=5840
-> BitmapOr (cost=22999.40..22999.40 rows=1242768 width=0) (actual time=44.622..44.637 rows=0 loops=1)
-> Bitmap Index Scan on x_a_idx (cost=0.00..1575.70 rows=85217 width=0) (actual time=3.622..3.634 rows=79999 loops=1)
Index Cond: (x.a < 80000)
-> Bitmap Index Scan on x_a_idx (cost=0.00..21378.20 rows=1157551 width=0) (actual time=40.998..40.998 rows=1000000 loops=1)
Index Cond: (x.a > 199000000)
Planning Time: 0.199 ms
Execution Time: 232.327 ms

[2]-----------Source table data in order and max_parallel_workers_per_gather = 4-----------
Gather (cost=1000.00..2091183.08 rows=91002 width=4) (actual time=0.594..4216.197 rows=579999 loops=1)
Output: a
Workers Planned: 4
Workers Launched: 4
-> Parallel Seq Scan on public.x (cost=0.00..2081082.88 rows=22750 width=4) (actual time=0.087..4197.570 rows=116000 loops=5)
Output: a
Filter: ((x.a < 80000) OR (((x.a % 2) = 0) AND (x.a > 199000000)))
Rows Removed by Filter: 39884000
Worker 0: actual time=0.083..4219.087 rows=96574 loops=1
Worker 1: actual time=0.076..4201.339 rows=195354 loops=1
Worker 2: actual time=0.096..4218.637 rows=96474 loops=1
Worker 3: actual time=0.118..4205.825 rows=96847 loops=1
Planning Time: 0.175 ms
Execution Time: 4243.089 ms
(14 rows)

[3]-----------Source table data out of order and max_parallel_workers_per_gather = 0-----------
Bitmap Heap Scan on public.x2 (cost=19815.15..4953678.20 rows=81178 width=4) (actual time=263.640..15653.251 rows=579999 loops=1)
Output: a
Recheck Cond: ((x2.a < 80000) OR (x2.a > 199000000))
Rows Removed by Index Recheck: 115394602
Filter: ((x2.a < 80000) OR (((x2.a % 2) = 0) AND (x2.a > 199000000)))
Rows Removed by Filter: 500000
Heap Blocks: exact=55343 lossy=629270
-> BitmapOr (cost=19815.15..19815.15 rows=1070588 width=0) (actual time=248.701..248.715 rows=0 loops=1)
-> Bitmap Index Scan on x2_a_idx (cost=0.00..1408.13 rows=76208 width=0) (actual time=18.116..18.117 rows=79999 loops=1)
Index Cond: (x2.a < 80000)
-> Bitmap Index Scan on x2_a_idx (cost=0.00..18366.43 rows=994381 width=0) (actual time=230.581..230.582 rows=1000000 loops=1)
Index Cond: (x2.a > 199000000)
Planning Time: 0.530 ms
Execution Time: 15700.488 ms

[4]-----------Source table data out of order and max_parallel_workers_per_gather = 4-----------
Gather (cost=1000.00..2090199.80 rows=81178 width=4) (actual time=0.674..5154.622 rows=579999 loops=1)
Output: a
Workers Planned: 4
Workers Launched: 4
-> Parallel Seq Scan on public.x2 (cost=0.00..2081082.00 rows=20294 width=4) (actual time=0.124..5112.635 rows=116000 loops=5)
Output: a
Filter: ((x2.a < 80000) OR (((x2.a % 2) = 0) AND (x2.a > 199000000)))
Rows Removed by Filter: 39884000
Worker 0: actual time=0.220..5136.219 rows=107646 loops=1
Worker 1: actual time=0.170..5133.046 rows=114824 loops=1
Worker 2: actual time=0.127..5128.256 rows=115010 loops=1
Worker 3: actual time=0.066..5133.022 rows=120061 loops=1
Planning Time: 0.194 ms
Execution Time: 5187.682 ms

Best regards,
houzj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2021-02-26 06:34:09 RE: Global snapshots
Previous Message tsunakawa.takay@fujitsu.com 2021-02-26 06:21:15 RE: libpq debug log