Re: Parallel Inserts in CREATE TABLE AS

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Tang, Haiying" <tanghy(dot)fnst(at)cn(dot)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Zhihong Yu <zyu(at)yugabyte(dot)com>, Luc Vlaming <luc(at)swarm64(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>
Subject: Re: Parallel Inserts in CREATE TABLE AS
Date: 2021-05-21 10:49:29
Message-ID: CALj2ACUyzFO_VPbrAjJeAf09pdAOmc1+ED33MQ3o4QfyfNzADw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 21, 2021 at 3:46 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy
> <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> >
> > On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy
> > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > >
> >
> > I analyzed performance of parallel inserts in CTAS for different cases
> > with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could
> > gain if the tuple sizes are lower. But if the tuple size is larger
> > i..e 1064bytes, there's a regression with parallel inserts. Upon
> > further analysis, it turned out that the parallel workers are
> > requiring frequent extra blocks addition while concurrently extending
> > the relation(in RelationAddExtraBlocks) and the majority of the time
> > spent is going into flushing those new empty pages/blocks onto the
> > disk.
> >
>
> How you have ensured that the cost is due to the flushing of pages?
> AFAICS, we don't flush the pages rather just write them and then
> register those to be flushed by checkpointer, now it is possible that
> the checkpointer sync queue gets full and the backend has to write by
> itself but have we checked that? I think we can check via wait events,
> if it is due to flush then we should see a lot of file sync
> (WAIT_EVENT_DATA_FILE_SYNC) wait events. The other possibility could
> be that the free pages added to FSM by one worker are not being used
> by another worker due to some reason. Can we debug and check if the
> pages added by one worker are being used by another worker?

Thanks! I will work on the above points sometime later.

BTW, I forgot to mention one point earlier that we see a benefit
without parallelism if only multi inserts are used for CTAS instead of
single inserts. See [2] for more testing results. I used "New Table
Access Methods for Multi and Single Inserts" patches from [1] for this
testing. I think it's a good idea to revisit that work.

[1] - https://www.postgresql.org/message-id/CALj2ACXdrOmB6Na9amHWZHKvRT3Z0nwTRsCwoMT-npOBtmXLXg%40mail.gmail.com
[2]
case 1 - 2 integer(of 4 bytes each) columns, tuple size 32 bytes, 100mn tuples
on master - 130sec
on master with multi inserts - 105sec, gain - 1.23X
on parallel CTAS patch without multi inserts - (2 workers, 82sec,
1.58X), (4 workers, 83sec, 1.56X)
on parallel CTAS patch with multi inserts - (2 workers, 45sec, 2.33X,
overall gain if seen from master 2.88X), (4 workers, 33sec, 3.18X,
overall gain if seen from master 3.9X)

case 2 - 2 integer(of 4 bytes each) columns, 3 varchar(8), tuple size
59 bytes, 100mn tuples
on master - 185sec
on master with multi inserts - 121sec, gain - 1.52X
on parallel CTAS patch without multi inserts - (2 workers, 120sec,
1.54X), (4 workers, 123sec, 1.5X)
on parallel CTAS patch with multi inserts - (2 workers, 68sec, 1.77X,
overall gain if seen from master 2.72X), (4 workers, 61sec, 1.98X,
overall gain if seen from master 3.03X)

Above two cases are the best cases with tuple size a few bytes where
parallel CTAS + multi inserts would give up to 3.9X and 3.03X
benefits.

case 3 - 2 bigint(of 8 bytes each) columns, 3 name(of 64 bytes each)
columns, 1 varchar(8), tuple size 241 bytes, 100mn tuples
on master - 367sec
on master with multi inserts - 291sec, gain - 1.26X
on parallel CTAS patch without multi inserts - (2 workers, 334sec,
1.09X), (4 workers, 336sec, 1.09X)
on parallel CTAS patch with multi inserts - (2 workers, 284sec,
1.02X, overall gain if seen from master 1.29X), (4 workers, 278sec,
1.04X, overall gain if seen from master 1.32X)

Above case where tuple size is 241 bytes, we don't gain much.

case 4 - 2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes each)
columns, tuple size 1064 bytes, 10mn tuples
on master - 120sec
on master with multi inserts - 115sec, gain - 1.04X
on parallel CTAS patch without multi inserts - (2 workers, 140sec,
0.85X), (4 workers, 142sec, 0.84X)
on parallel CTAS patch with multi inserts - (2 workers, 133sec,
0.86X, overall loss if seen from master 0.9X), (4 workers, 134sec,
0.85X, overall loss if seen from master 0.89X)

Above case where tuple size is 1064 bytes, we gain very little with
multi inserts and with parallel inserts we cause regression.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2021-05-21 11:21:51 Re: Refactor "mutually exclusive options" error reporting code in parse_subscription_options
Previous Message Bharath Rupireddy 2021-05-21 10:26:31 Re: Alias collision in `refresh materialized view concurrently`