Re: Parallel copy

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-10-23 13:28:04
Message-ID: CAE9k0PnXBAvJ8wAEuUs7DBZ5-5ebZQMVzmAjKn1bw4Kyfqv2Fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 23, 2020 at 5:42 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that are
> + * required by the workers. This information will then be allocated and stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> + /* low-level state data */
> + CopyDest copy_dest; /* type of copy source/destination */
> + int file_encoding; /* file or remote side's character encoding */
> + bool need_transcoding; /* file encoding diff from server? */
> + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> + /* Working state for COPY FROM */
> + AttrNumber num_defaults;
> + Oid relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>
> --
>
> + pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> + relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>
> --
>
> +/* DSM keys for parallel copy. */
> +#define PARALLEL_COPY_KEY_SHARED_INFO 1
> +#define PARALLEL_COPY_KEY_CSTATE 2
> +#define PARALLEL_COPY_WAL_USAGE 3
> +#define PARALLEL_COPY_BUFFER_USAGE 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>
> --
>
> if (resultRelInfo->ri_TrigDesc != NULL &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
> resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
> {
> /*
> * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
> * triggers on the table. Such triggers might query the table we're
> * inserting into and act differently if the tuples that have already
> * been processed and prepared for insertion are not there.
> */
> insertMethod = CIM_SINGLE;
> }
> else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
> resultRelInfo->ri_TrigDesc->trig_insert_new_table)
> {
> /*
> * For partitioned tables we can't support multi-inserts when there
> * are any statement level insert triggers. It might be possible to
> * allow partitioned tables with such triggers in the future, but for
> * now, CopyMultiInsertInfoFlush expects that any before row insert
> * and statement level insert triggers are on the same relation.
> */
> insertMethod = CIM_SINGLE;
> }
> else if (resultRelInfo->ri_FdwRoutine != NULL ||
> cstate->volatile_defexprs)
> {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

Just rewriting above comment to make it a bit more clear:

I think, if possible, all these if-else checks in CopyFrom() should be
moved to a separate function which can probably be named as
IdentifyCopyInsertMethod() and this function called from
IsParallelCopyAllowed() and CopyFrom() functions. It will only be
called from CopyFrom() when IsParallelCopy() returns false. This will
ensure that in case of Parallel Copy if the leader has performed all
these checks, the worker won't do it again. I also feel that having a
separate function containing all these checks will make the code look
a bit cleaner.

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
> +{
> ...
> ...
> + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> + &walusage[ParallelWorkerNumber]);
> +
> + MemoryContextSwitchTo(oldcontext);
> + pfree(cstate);
> + return;
> +}
>
> It seems like you also need to delete the memory context
> (cstate->copycontext) here.
>
> --
>
> +void
> +ExecBeforeStmtTrigger(CopyState cstate)
> +{
> + EState *estate = CreateExecutorState();
> + ResultRelInfo *resultRelInfo;
>
> This function has a lot of comments which have been copied as it is
> from the CopyFrom function, I think it would be good to remove those
> comments from here and mention that this code changes done in this
> function has been taken from the CopyFrom function. If any queries
> people may refer to the CopyFrom function. This will again avoid the
> unnecessary code in the patch.
>
> --
>
> As Heikki rightly pointed out in his previous email, we need some high
> level description of how Parallel Copy works somewhere in
> copyparallel.c file. For reference, please see how a brief description
> about parallel vacuum has been added in the vacuumlazy.c file.
>
> * Lazy vacuum supports parallel execution with parallel worker processes. In
> * a parallel vacuum, we perform both index vacuum and index cleanup with
> * parallel worker processes. Individual indexes are processed by one vacuum
> ...
> ...
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
>
> On Wed, Oct 21, 2020 at 12:08 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Mon, Oct 19, 2020 at 2:40 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com> wrote:
> > > >
> > > > Hi Vignesh,
> > > >
> > > > After having a look over the patch,
> > > > I have some suggestions for
> > > > 0003-Allow-copy-from-command-to-process-data-from-file.patch.
> > > >
> > > > 1.
> > > >
> > > > +static uint32
> > > > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
> > > > + char **whereClauseStr, char **rangeTableStr,
> > > > + char **attnameListStr, char **notnullListStr,
> > > > + char **nullListStr, char **convertListStr)
> > > > +{
> > > > + uint32 strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
> > > > +
> > > > + strsize += EstimateStringSize(cstate->null_print);
> > > > + strsize += EstimateStringSize(cstate->delim);
> > > > + strsize += EstimateStringSize(cstate->quote);
> > > > + strsize += EstimateStringSize(cstate->escape);
> > > >
> > > >
> > > > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape.
> > > > But the length of null_print seems has been stored in null_print_len.
> > > > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary.
> > > >
> > > > How about " strsize += sizeof(uint32) + cstate->null_print_len + 1"
> > > >
> > >
> > > +1. This seems like a good suggestion but add comments for
> > > delim/quote/escape to indicate that we are considering one-byte for
> > > each. I think this will obviate the need of function
> > > EstimateStringSize. Another thing in this regard is that we normally
> > > use add_size function to compute the size but I don't see that being
> > > used in this and nearby computation. That helps us to detect overflow
> > > of addition if any.
> > >
> > > EstimateCstateSize()
> > > {
> > > ..
> > > +
> > > + strsize++;
> > > ..
> > > }
> > >
> > > Why do we need this additional one-byte increment? Does it make sense
> > > to add a small comment for the same?
> > >
> >
> > Changed it to handle null_print, delim, quote & escape accordingly in
> > the attached patch, the one byte increment is not required, I have
> > removed it.
> >
> > Regards,
> > Vignesh
> > EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-10-23 13:33:10 Re: git clone failed in windows
Previous Message Ashutosh Bapat 2020-10-23 12:55:57 Re: Enumize logical replication message actions