RE: Parallel INSERT SELECT take 2

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Greg Nancarrow <gregn4422(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: RE: Parallel INSERT SELECT take 2
Date: 2026-05-11 07:21:14
Message-ID: TY4PR01MB17718A4DE63020A9EA5E9CB6594382@TY4PR01MB17718.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Saturday, May 9, 2026 5:29 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> On 5/9/26 08:31, Zhijie Hou (Fujitsu) wrote:
> > On Saturday, May 9, 2026 7:59 AM Tomas Vondra <tomas(at)vondra(dot)me>
> wrote:
> >>
> >> declarative approach
> >> --------------------
> >>
> >> The new declarative approach in inspired by Tom's post [3] - there's a new
> >> pg_class attribute, determining if a given target relation (including all it's child
> >> partitions) is assumed to be "safe" (i.e. at least restricted).
> >>
> >> I like that this is "caching" the safety information in pg_class (which means we
> >> don't have to do the expensive stuff even for the first insert in a connection).
> >> That seems like a good approach.
> >>
> >> But I very much dislike that we're shifting all the responsibility to the
> >> user/developer. I have very little faith people will understand what is required
> >> for safety, and I believe we'd get plenty of bogus bug reports due to this.
> >>
> >> I 100% agree with Robert [4] that labeling functions as parallel-safe is not the
> >> same as labeling target relations. We can't determine safety for functions
> >> (halting theorem and all that). We can determine safety for target relations,
> >> we just choose not to because of overhead.
> >>
> >> But why can't we keep the pg_class updated automatically? I mean, every
> >> time a parallel safety of a relation changes (e.g. because a trigger is
> >> added/removed, a default expression changes, ...), we could recalculate the
> >> pg_class flag, and then update all the parent relations too.
> >>
> >> If we did that, then we'd not need the manual declarations at all, I think. Sure,
> >> determining the safety may not be cheap, but so what? It should be only a
> >> very rare operation. If you do ALTER TABLE often enough for this to matter,
> >> you have other problems, I guess.
> >>
> >> Yes, there may be other problems, e.g. there may be deadlock hazards when
> >> updating the parent relations, as pointed out by Robert [5]. But that outght to
> >> be solvable, e.g. by figuring out which relations to lock and then locking them
> >> from the top (and retrying if needed).
> >>
> >> There's also suggestions there might be race conditions, e.g. [3]. I haven't
> >> thought about this very much, but I don't quite see why this would be any
> >> different from other ALTER TABLE commands. Maybe I'm missing something
> >> obvious ...
> >>
> >> I'm not sure how difficult it is to figure out all possibly affected relations,
> >> when e.g. a marking on a function changes. But these claims [6] seem rather
> >> hand-wavy, and I'd like to see some clear explanation why this is not
> >> practical. Maybe it'd require improving some of the dependency tracking, or
> >> whatever. But so what? New features often require that kind of stuff.
> >
> > From my personal understanding, the question isn't practicality, but rather how
> > to justify the added complexity and the locking strategy change.
> >
> > For partitioned tables, if we want to automatically adjust
> > pg_class.parallel_safety, we would need to do the following. For example, when
> > executing ALTER FUNCTION ... PARALLEL SAFE or CREATE OR REPLACE FUNCTION, we
> > would first need to locate all tables that have any triggers, index expressions,
> > or other elements that could affect parallel safety and that use the altered
> > function. Then, we would need to find all parent tables of those tables and
> > update pg_class.parallel_safety accordingly. To make this concurrency-safe, we
> > would likely need to take locks on the affected tables.
> >
> > I don't think this is impossible to implement, but I'm not entirely sure
> whether
> > it would be acceptable to perform these operations during a FUNCTION DDL, as it
> > currently does not take locks on user tables. While I understand that
> > dependencies exist between tables and functions, it might not be immediately
> > intuitive to users that altering a function could lock user tables.
> > Additionally, this approach would require developers to carefully identify all
> > tables that use the changed function, which could be error-prone. I haven't
> > analyzed this deeply, so I cannot say for certain whether it would be easy or
> > not.
> >
> > Similarly, for ALTER TABLE that changes certain expressions, we currently do not
> > take locks on the parent tables of the one being altered. However, we would need
> > to do so if we want to maintain parallel safety. Please forgive me if I've missed
> > something.
> >
>
> I don't see why this would be a problem, honestly. I mean, yes, it may
> not be cheap, both in terms of new code and locks, but so what? The
> question is if the feature is worth this extra cost.
>
> Also, we already do a lot of this (or similar) stuff in other contexts.
> For example, DROP FUNCTION needs to look for relations referencing the
> function in some way, etc. In fact, the helper functions in 0003 do most
> of the searching for affected objects, no?
>
> Yes, it may require more locks on parent relations. But again, that
> seems acceptable to me. Changing the labels should be an extremely rare
> operation in practice, and if documented properly it's no different from
> any other ALTER TABLE. Sure, it'd be great if we could do away without
> it, but if we can't, it's a cost of having the feature.
>
> I suspect we can do various optimizations. For example, we don't need to
> lock the parent tables if the parallel safety does not changes, or if
> the pg_class already matches the result.
>

After searching my memory and reviewing the old discussion, I recall another
locking-related issue for automatic pg_class.parallel_safety adjustment: the
visibility of tables associated with a function. (You may have noticed this when
reading that thread as well - I'm just sharing it here for reference.)

For example, consider creating a new table that uses a function in an index
expression (or altering a table to add a new expression or function). If a user tries
to alter the function used in that expression concurrently, the ALTER FUNCTION
command cannot see the newly created table because it hasn't been committed yet.
As a result, we cannot ensure that pg_class.parallel_safety is updated in a
concurrency-safe manner. The cause is that we don't hold a lock on the
function when creating or altering a table.

I think this might be solvable if we take a new heavyweight lock on the function
during all DDLs that could affect parallel DML safety. We would likely
need to acquire that lock for a few DDL commands, including: CREATE TABLE,
ALTER TABLE, CREATE INDEX, and ATTACH PARTITION, any command where
expressions are added that could be evaluated during DML.

While it is technically possible, the complexity gave me the impression that it
would be hard to get everything correct, and as a result, the design
wasn't very attractive to us at that time. Having said that, I am not opposed to
this approach, just want to share all the tricky part so that we can evaluate
whether the extra logic is worth the effort.

(While considering this, I wondered what would happen if we drop a function
concurrently with the creation of a table that references it in a column
default. Surprisingly, the table creation succeeds, but later reading the column
default fails with: ERROR: cache lookup failed for function 16388. This seems
like an existing concurrency safety issue for functions referenced in table
definitions.)

>
> >> alternative idea
> >> ----------------
> >>
> >> What if we took a very different approach, and just made sure the INSERT
> >> part never runs concurrently with the SELECT? Say we fully materialize the
> >> SELECT result (e.g. write it into a tuplestore), finish the parallel mode, and
> >> only then do the INSERT?
> >>
> >> That's pretty much what existing workarounds do - they do the SELECT,
> stash
> >> the result somewhere (either outside the database or in a CTAS), and then
> do
> >> a separate INSERT into the table.
> >>
> >> AFAIK that wouldn't have any of the issues mentioned here. Yes, it may
> >> require writing the data into an temporary file, but that seems like
> something
> >> we could cost.
> >>
> >> I haven't even tried to implement this, but it seems workable. Or am I
> missing
> >> something?
> >
> > I haven't tried this approach before, so I haven't analyzed its feasibility, but
> > it does seem feasible.
> >
> > That said, one concern is that, the natural next step beyond parallel SELECT
> for
> > INSERT is to eventually support true parallel INSERT. If we hope to reach
> that
> > point in the future, we will still need to address the parallel safety issue
> > regardless. Therefore, if solving the safety issue is necessary anyway, it
> might
> > be better to tackle it first. As I understand it, saving changes to a temporary
> > file is less efficient than inserting tuples concurrently with parallel SELECT.
> >
>
> I disagree. To me, this seems like two entirely distinct features:
>
> (a) INSERT + PARALLEL SELECT
> (b) PARALLEL INSERT + SELECT
>
> and forcing (a) to implement stuff needed only by (b) feels wrong.
>
> I'd be more concerned about solving the parallel safety for (b), if the
> approach for (a) was making it harder. But I don't think that happens
> here at all. Spilling the SELECT result into a file merely makes it look
> as if it was executed without parallelism. It does not affect the INSERT
> phase at all - it will have to do check the safety exactly the same.

I was wondering whether the new code that saves tuples to a file and feeds them
to the leader worker would still be useful after (b) is supported. At first, I
thought that after supporting (b), we might be able to support (a) more easily
by reusing the safety check logic, and the tuple save/read code might become
dead. However, after thinking more about it, the tuple spooling might still be
needed for cases where the target table is parallel DML unsafe — even though
such use cases may be relatively small. So, I don't oppose the alternative idea.
I think I was simply sensitive about adding a bunch of codes that might
eventually be removed in the end.

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-05-11 07:25:31 Re: SUBSCRIPTION SERVER ALTER/DROP operations stuck when user mapping is dropped
Previous Message SATYANARAYANA NARLAPURAM 2026-05-11 07:12:29 SUBSCRIPTION SERVER ALTER/DROP operations stuck when user mapping is dropped