Re: Parallel INSERT SELECT take 2

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel INSERT SELECT take 2
Date: 2021-04-27 00:50:43
Message-ID: CALj2ACWSu7uubENxeNPE537fpt3u7x+eTBFHZ3S5udxboaXTwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 26, 2021 at 4:56 PM houzj(dot)fnst(at)fujitsu(dot)com
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > The parallel attributes in table means the parallel safety when user does some
> > data-modification operations on it.
> > > So, It only limit the use of parallel plan when using INSERT/UPDATE/DELETE.
> >
> > In that case, isn't it better to use the terminology "PARALLEL DML
> > SAFE/UNSAFE/RESTRICTED" in the code and docs? This way, it will be clear that
> > these tags don't affect parallel selects.
>
> Makes sense, I recalled I have heart similar suggestion before.
> If there are no other objections, I plan to change command to
> PARALLEL DML in my next version patches.

+1 from me. Let's hear from others.

> > Will the planner parse and check parallel safety of index((where
> > clause) expressions in case of SELECTs? I'm not sure of this. But if it does, maybe
> > we could do the same thing for parallel DML as well for normal tables?
>
> The planner does not check the index expression, because the expression will not be used in SELECT.
> I think the expression is only used when a tuple inserted into the index.

Oh.

> > the overhead of parsing index expressions? If the cost is heavy for checking
> > index expressions parallel safety in case of normal tables, then the current
> > design i.e. attributing parallel safety tag to all the tables makes sense.
>
> Currently, index expression and predicate are stored in text format.
> We need to use stringToNode(expression/predicate) to parse it.
> Some committers think doing this twice does not look good,
> unless we found some ways to pass parsed info to the executor to avoid the second parse.

How much is the extra cost that's added if we do stringToNode twiice?
Maybe, we can check for a few sample test cases by just having
stringToNode(expression/predicate) in the planner and see if it adds
much to the total execution time of the query.

> > I was actually thinking that we will have the declarative approach only for
> > partitioned tables as it is the main problem we are trying to solve with this
> > design. Something like: users will run pg_get_parallel_safety to see the parallel
> > unsafe objects associated with a partitioned table by looking at all of its
> > partitions and be able to set a parallel dml safety tag to only partitioned tables.
>
> We originally want to make the declarative approach for both normal and partitioned table.
> In this way, it will not bring any overhead to planner and looks consistency.
> But, do you think we should put some really cheap safety check to the planner ?

I still feel that why we shouldn't limit the declarative approach to
only partitioned tables? And for normal tables, possibly with a
minimal cost(??), the server can do the safety checking. I know this
feels a little inconsistent. In the planner we will have different
paths like: if (partitioned_table) { check the parallel safety tag
associated with the table } else { perform the parallel safety of the
associated objects }.

Others may have better thoughts on this.

> > >0001: provide a utility function pg_get_parallel_safety('table_name').
> > >
> > >The function returns records of (objid, classid, parallel_safety) that
> > >represent the parallel safety of objects that determine the parallel safety of
> > the specified table.
> > >Note: The function only outputs objects that are not parallel safe.
> >
> > If it returns only parallel "unsafe" objects and not "safe" or "restricted" objects,
> > how about naming it to pg_get_table_parallel_unsafe_objects("table_name")?
>
> Currently, the function returns both unsafe and restricted objects(I thought restricted is also not safe),
> because we thought users only care about the objects that affect the use of parallel plan.

Hm.

> > This way we could get rid of parallel_safety in the output record? If at all users
> > want to see parallel restricted or safe objects, we can also have the
> > counterparts pg_get_table_parallel_safe_objects and
> > pg_get_table_parallel_restricted_objects. Of course, we can caution the user
> > that execution of these functions might take longer and "might consume
> > excessive memory while accumulating the output".
> >
> > Otherwise, we can have a single function pg_get_parallel_safety("table_name"
> > IN, "parallel_safety" IN, "objid"
> > OUT, "classid" OUT)? If required, we could name it
> > pg_get_parallel_safety_of_table_objects.
> >
> > Thoughts?
>
> I am sure will user want to get safe objects, do you have some usecases ?
> If users do not care the safe objects, I think they can use
> "SELECT * FROM pg_get_parallel_safety() where parallel_safety = 'specified safety' " to get the specified objects.

I don't know any practical scenarios, but If I'm a user, at least I
will be tempted to see the parallel safe objects associated with a
particular table along with unsafe and restricted ones. Others may
have better thoughts on this.

> > Although, I have not looked at the patches, few questions on
> > pg_get_parallel_safety function:
> > 1) Will it parse all the expressions for the objects that are listed under "The
> > objects that relate to the parallel safety of a DML target table are as follows:" in
> > the upthread?
>
> Yes.
> But some parsed expression(such as domain type's expression) can be found in typecache,
> we just check the safety for these already parsed expression.

Oh.

> > 2) How will it behave if a partitioned table is passed to it? Will it recurse for all
> > the partitions?
>
> Yes, because both parent table and child table will be inserted and the parallel
> related objects on them will be executed. If users want to make sure the parallel insert succeed,
> they need to check all the objects.

Then, running the pg_get_parallel_safety will have some overhead if
there are many partitions associated with a table. And, this is the
overhead planner would have had to incur without the declarative
approach which we are trying to avoid with this design.

> > 3) How will it behave if a foreign table is passed to it? Will it error out?
>
> It currently does not error out.
> It will also check the objects on it and return not safe objects.
> Note: I consider foreign table itself as a parallel restricted object, because it does not support parallel insert fdw api for now.
>
> > 2) How will users have to decide on parallel safety of a foreign table or a
> > partitioned table with foreign partitions? Or is it that we set these tables
> > parallel unsafe and don't do parallel inserts?
>
> Foreign table itself is considered as parallel restricted,
> because we do not support parallel insert fdw api for now.

Maybe, the ALTER TABLE ... SET PARALLEL on a foreign table should
default to parallel restricted always and emit a warning saying the
reason?

> > In general:
> > 1) Is ALTER SET PARALLEL SAFETY on a partitioned table allowed?
>
> Yes.
>
> > If yes, will it be
> > set based on all the partitions parallel safety?
>
> Yes,
> But the ALTER PARALLEL command itself does not check all the partition's safety flag.
> The function pg_get_parallel_safety can return all the partition's not safe objects,
> user should set the parallel safety based the function's result.

I'm thinking that when users say ALTER TABLE partioned_table SET
PARALLEL TO 'safe';, we check all the partitions' and their associated
objects' parallel safety? If all are parallel safe, then only we set
partitioned_table as parallel safe. What should happen if the parallel
safety of any of the associated objects/partitions changes after
setting the partitioned_table safety?

My understanding was that: the command ALTER TABLE ... SET PARALLEL TO
'safe' work will check the parallel safety of all the objects
associated with the table. If the objects are all parallel safe, then
the table will be set to safe. If at least one object is parallel
unsafe or restricted, then the command will fail. I was also thinking
that how will the design cope with situations such as the parallel
safety of any of the associated objects changing after setting the
table to parallel safe. The planner would have relied on the outdated
parallel safety of the table and chosen parallel inserts and the
executor will catch such situations. Looks like my understanding was
wrong.

So, the ALTER TABLE ... SET PARALLEL TO command just sets the target
table safety, doesn't bother what the associated objects' safety is.
It just believes the user. If at all there are any parallel unsafe
objects it will be caught by the executor. Just like, setting parallel
safety of the functions/aggregates, the docs caution users about
accidentally/intentionally tagging parallel unsafe
functions/aggregates as parallel safe.

Note: I meant the table objects are the ones that are listed under
"The objects that relate to the parallel safety of a DML target table
are as follows:" in the upthread.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-04-27 00:50:57 Re: Enhanced error message to include hint messages for redundant options error
Previous Message David Fetter 2021-04-27 00:24:34 Bug fix for tab completion of ALTER TABLE ... VALIDATE CONSTRAINT ...