| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com> |
| Cc: | PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Subject: | Re: Parallel INSERT SELECT take 2 |
| Date: | 2026-05-11 11:44:13 |
| Message-ID: | 1022baf6-b2f9-4323-a84e-7aacc6fb6c2d@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 5/11/26 09:21, Zhijie Hou (Fujitsu) wrote:
> On Saturday, May 9, 2026 5:29 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> ...
>> I suspect we can do various optimizations. For example, we don't need to
>> lock the parent tables if the parallel safety does not changes, or if
>> the pg_class already matches the result.
>>
>
> After searching my memory and reviewing the old discussion, I recall another
> locking-related issue for automatic pg_class.parallel_safety adjustment: the
> visibility of tables associated with a function. (You may have noticed this when
> reading that thread as well - I'm just sharing it here for reference.)
>
> For example, consider creating a new table that uses a function in an index
> expression (or altering a table to add a new expression or function). If a user tries
> to alter the function used in that expression concurrently, the ALTER FUNCTION
> command cannot see the newly created table because it hasn't been committed yet.
> As a result, we cannot ensure that pg_class.parallel_safety is updated in a
> concurrency-safe manner. The cause is that we don't hold a lock on the
> function when creating or altering a table.
>
> I think this might be solvable if we take a new heavyweight lock on the function
> during all DDLs that could affect parallel DML safety. We would likely
> need to acquire that lock for a few DDL commands, including: CREATE TABLE,
> ALTER TABLE, CREATE INDEX, and ATTACH PARTITION, any command where
> expressions are added that could be evaluated during DML.
>
> While it is technically possible, the complexity gave me the impression that it
> would be hard to get everything correct, and as a result, the design
> wasn't very attractive to us at that time. Having said that, I am not opposed to
> this approach, just want to share all the tricky part so that we can evaluate
> whether the extra logic is worth the effort.
>
Yes, problems like this may be tricky.
But I don't see a problem with requiring an exclusive lock on a function
when changing the parallel safety for a function (and updating the
pg_class attribute for all relations that use it). Yes, it's not great,
it'd be nice to do it with weaker locks, but if we can't ... sorry.
I really don't want to be inventing overly-complicated solutions for a
problem that almost never happens in production. I mean, how often do we
expect people to adjust this? When deploying new schema, which already
gets a bunch of strong locks anyway?
Similarly for creation of new tables. That's not something that happens
in every transaction. Maybe there are use cases where tables are created
all the time? One of my previous employers did exactly that, using
tables to "cache" intermediate results, but even then it'd not matter,
because those tables did not have defaults, were not partitioned etc.
From a different angle, maybe we could also prohibit changing the flag
after creation? That means you'd set it in CREATE FUNCTION, and then
it'd stay the same. This way we'd not have issues for invisible tables,
I think. Of course, this cure may be worse that the disease, because
then "correcting" an incorrect parallel safety requires creating a new
function, doing ALTER TABLE on all affected relations, etc. Which will
take exclusive locks too.
> (While considering this, I wondered what would happen if we drop a function
> concurrently with the creation of a table that references it in a column
> default. Surprisingly, the table creation succeeds, but later reading the column
> default fails with: ERROR: cache lookup failed for function 16388. This seems
> like an existing concurrency safety issue for functions referenced in table
> definitions.)
>
This walks and quacks like a bug. I wouldn't be surprised it was a
symptom of an insufficient locking somewhere.
>>
>>>> alternative idea
>>>> ----------------
>>>>
>>>> What if we took a very different approach, and just made sure the INSERT
>>>> part never runs concurrently with the SELECT? Say we fully materialize the
>>>> SELECT result (e.g. write it into a tuplestore), finish the parallel mode, and
>>>> only then do the INSERT?
>>>>
>>>> That's pretty much what existing workarounds do - they do the SELECT,
>> stash
>>>> the result somewhere (either outside the database or in a CTAS), and then
>> do
>>>> a separate INSERT into the table.
>>>>
>>>> AFAIK that wouldn't have any of the issues mentioned here. Yes, it may
>>>> require writing the data into an temporary file, but that seems like
>> something
>>>> we could cost.
>>>>
>>>> I haven't even tried to implement this, but it seems workable. Or am I
>> missing
>>>> something?
>>>
>>> I haven't tried this approach before, so I haven't analyzed its feasibility, but
>>> it does seem feasible.
>>>
>>> That said, one concern is that, the natural next step beyond parallel SELECT
>> for
>>> INSERT is to eventually support true parallel INSERT. If we hope to reach
>> that
>>> point in the future, we will still need to address the parallel safety issue
>>> regardless. Therefore, if solving the safety issue is necessary anyway, it
>> might
>>> be better to tackle it first. As I understand it, saving changes to a temporary
>>> file is less efficient than inserting tuples concurrently with parallel SELECT.
>>>
>>
>> I disagree. To me, this seems like two entirely distinct features:
>>
>> (a) INSERT + PARALLEL SELECT
>> (b) PARALLEL INSERT + SELECT
>>
>> and forcing (a) to implement stuff needed only by (b) feels wrong.
>>
>> I'd be more concerned about solving the parallel safety for (b), if the
>> approach for (a) was making it harder. But I don't think that happens
>> here at all. Spilling the SELECT result into a file merely makes it look
>> as if it was executed without parallelism. It does not affect the INSERT
>> phase at all - it will have to do check the safety exactly the same.
>
> I was wondering whether the new code that saves tuples to a file and feeds them
> to the leader worker would still be useful after (b) is supported. At first, I
> thought that after supporting (b), we might be able to support (a) more easily
> by reusing the safety check logic, and the tuple save/read code might become
> dead. However, after thinking more about it, the tuple spooling might still be
> needed for cases where the target table is parallel DML unsafe — even though
> such use cases may be relatively small. So, I don't oppose the alternative idea.
> I think I was simply sensitive about adding a bunch of codes that might
> eventually be removed in the end.
>
Not sure. I think those are good questions, I don't have clear answers
to them. To me it seems like two separate features - parallelizing the
SELECT vs. parallelizing the INSERT itself, and I don't immediately see
how solving (b) would make (a) solved too.
For example, I think it's pretty clear there definitely will be cases
where we can't do (b) because the default expressions are not parallel
safe, or whatever. For example, the nextval() is marked as unsafe, which
likely means any table with a sequence won't allow a parallel insert.
Allowing at least the SELECT to be parallel seems helpful ...
Of course, I don't know if spilling the results into a tuplestore will
solve the problems. That's just a wild idea from last week. But I did
have a bit of time over the weekend, so I cobbled together a quick PoC:
https://github.com/tvondra/postgres/commits/insert-select-parallel-batch/
It allows the SELECT to be parallel, and then injects a "materialize"
node on top of the subquery if needed. It kinda works, is good enough
for experiments / discussion. I won't have time to work on this for the
next couple weeks very much, so a couple comments / questions I have:
a) The new "FullMaterialize" node may be unnecessary, I imagine we could
adjust "Materialize" to work for this use case too. That'd reduce the
amount of new code quite a bit. But good enough for a PoC.
b) It still requires assigning the XID before starting the insert. I
thought the "parallel mode" ends once the select fully materializes the
results, but it does not work like that. I suppose it works like this
for a reason, and my mental model was off. Or could it work like that?
c) It's not clear to me how to determine if a path is using parallelism,
and if grouping_planner needs to inject the new materialization node. I
assumes we can just use the parallelModeOK/parallelModeNeeded stuff, but
it does not seem like that. In the end I hacked together a simple walker
that goes through the paths and counts Gather/GatherMerge nodes. But I
have a feeling this may not be quite right (and that we surely have a
way to walk path trees, and I just can't find that).
d) I'm not entirely certain how this will behave e.g. with INSERT+SELECT
nested in a CTE. Maybe it won't do the right thing?
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alexander Korotkov | 2026-05-11 12:02:24 | Re: Function scan FDW pushdown |
| Previous Message | vignesh C | 2026-05-11 11:04:39 | Re: Include schema-qualified names in publication error messages. |