RE: Parallel Inserts in CREATE TABLE AS

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Zhihong Yu <zyu(at)yugabyte(dot)com>, Luc Vlaming <luc(at)swarm64(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: RE: Parallel Inserts in CREATE TABLE AS
Date: 2021-05-27 01:42:07
Message-ID: OS0PR01MB571699DD426B43CD2F9A106C94239@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Sent: Wednesday, May 26, 2021 7:22 PM
> Thanks for trying that out.
>
> Please see the code around the use_fsm flag in RelationGetBufferForTuple for
> more understanding of the points below.
>
> What happens if FSM is skipped i.e. myState->ti_options =
> TABLE_INSERT_SKIP_FSM;?
> 1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple.
> 2) Each worker initially gets a block and keeps inserting into it until it is full.
> When the block is full, the worker doesn't look in FSM GetPageWithFreeSpace
> as use_fsm is false. It directly goes for relation extension and tries to acquire
> relation extension lock with LockRelationForExtension. Note that the bulk
> extension of blocks with RelationAddExtraBlocks is not reached as use_fsm is
> false.
> 3) After acquiring the relation extension lock, it adds an extra new block with
> ReadBufferBI(relation, P_NEW, ...), see the comment "In addition to whatever
> extension we performed above, we always add at least one block to satisfy our
> own request." The tuple is inserted into this new block.
>
> Basically, the workers can't look for the empty pages from the pages added by
> other workers, they keep doing the above steps in silos.
>
> What happens if FSM is not skipped i.e. myState->ti_options = 0;?
> 1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple.
> 2) Each worker initially gets a block and keeps inserting into it until it is full.
> When the block is full, the worker looks for the page with free space in FSM
> GetPageWithFreeSpace as use_fsm is true.
> If it can't find any page with the required amount of free space, it goes for bulk
> relation extension(RelationAddExtraBlocks) after acquiring relation extension
> lock with ConditionalLockRelationForExtension. Then the worker adds
> extraBlocks = Min(512, lockWaiters * 20); new blocks in
> RelationAddExtraBlocks and immediately updates the bottom level of FSM for
> each block (see the comment around RecordPageWithFreeSpace for why only
> the bottom level, not the entire FSM tree). After all the blocks are added, then
> it updates the entire FSM tree FreeSpaceMapVacuumRange.
> 4) After the bulk extension, then the worker adds another block see the
> comment "In addition to whatever extension we performed above, we always
> add at least one block to satisfy our own request." and inserts tuple into this
> new block.
>
> Basically, the workers can benefit from the bulk extension of the relation and
> they always can look for the empty pages from the pages added by other
> workers. There are high chances that the blocks will be available after bulk
> extension. Having said that, if the added extra blocks are consumed by the
> workers so fast i.e. if the tuple sizes are big i.e very less tuples per page, then,
> the bulk extension too can't help much and there will be more contention on
> the relation extension lock. Well, one might think to add more blocks at a time,
> say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512,
> lockWaiters * 20);. This will work (i.e. we don't see any regression with parallel
> inserts in CTAS patches), but it can't be a practical solution. Because the total
> pages for the relation will be more with many pages having more free space.
> Furthermore, the future sequential scans on that relation might take a lot of
> time.
>
> If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the place(within if
> (myState->is_parallel)), then it will be effective for leader i.e. leader will not
> look for FSM, but all the workers will, because within if
> (myState->is_parallel_worker) in intorel_startup,
> myState->ti_options = 0; for workers.
>
> I ran tests with configuration shown at [1] for the case 4 (2 bigint(of 8 bytes
> each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn
> tuples) with leader participation where I'm seeing regression:
>
> 1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader and
> workers, then my results are as follows:
> 0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275
> 2) when myState->ti_options = 0; for both leader and workers, then my results
> are as follows:
> 0 workers - 1116184.718, 2 workers - 139798.055, 4 workers - 143022.409
> I hope the above explanation and the test results should clarify the fact that
> skipping FSM doesn't solve the problem. Let me know if anything is not clear or
> I'm missing something.

Thanks for the explanation.
I followed your above test steps and the below configuration, but my test results are a little different from yours.
I am not sure the exact reason, maybe because of the hardware..

Test INSERT 10000000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) columns):
SERIAL: 22023.631 ms
PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms [SKIP FSM]: 19381.474 ms
PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms [SKIP FSM]: 18381.305 ms

I am afraid that the using the FSM seems not get a stable performance gain(at least on my machine),
I will take a deep look into this to figure out the difference. A naive idea it that the benefit that bulk extension
bring is not much greater than the cost in FSM.
Do you have some ideas on it ?

My test machine:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping: 7
CPU MHz: 2901.005
CPU max MHz: 3200.0000
CPU min MHz: 1000.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K

Best regards,
houzj

> [1] postgresql.conf parameters used:
> shared_buffers = 40GB
> max_worker_processes = 32
> max_parallel_maintenance_workers = 24
> max_parallel_workers = 32
> synchronous_commit = off
> checkpoint_timeout = 1d
> max_wal_size = 24GB
> min_wal_size = 15GB
> autovacuum = off
> port = 5440
>
> System Configuration:
> RAM: 528GB
> Disk Type: SSD
> Disk Size: 1.5TB
> lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Thread(s) per core: 2
> Core(s) per socket: 8
> Socket(s): 8
> NUMA node(s): 8
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 47
> Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
> Stepping: 2
> CPU MHz: 1064.000
> CPU max MHz: 2129.0000
> CPU min MHz: 1064.0000
> BogoMIPS: 4266.62
> Virtualization: VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 24576K

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-05-27 01:54:15 Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Previous Message Julien Rouhaud 2021-05-27 01:26:23 Re: Speed up pg_checksums in cases where checksum already set