Re: [PATCH] plpython function causes server panic

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hao Zhang <zhrt1446384557(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] plpython function causes server panic
Date: 2023-12-02 01:04:15
Message-ID: 301509.1701479055@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hao Zhang <zhrt1446384557(at)gmail(dot)com> writes:
> I found a problem when executing the plpython function:
> After the plpython function returns an error, in the same session, if we
> continue to execute
> plpython function, the server panic will be caused.

Thanks for the report! I see the problem is that we're not expecting
BeginInternalSubTransaction to fail. However, I'm not sure I like
this solution, mainly because it's only covering a fraction of the
problem. There are similarly unsafe usages in plperl, pltcl, and
very possibly a lot of third-party PLs. I wonder if there's a way
to deal with this issue without changing these API assumptions.

The only readily-reachable error case in BeginInternalSubTransaction
is this specific one about IsInParallelMode, which was added later
than the original design and evidently with not a lot of thought or
testing. The comment for it speculates about whether we could get
rid of it, so I wonder if our thoughts about this ought to go in that
direction.

In any case, if we do proceed along the lines of catching errors
from BeginInternalSubTransaction, I think your patch is a bit shy
of a load because it doesn't do all the same things that other callers
of PLy_spi_exception_set do. Looking at that made me wonder why
the PLy_spi_exceptions lookup business was being duplicated by every
caller rather than being done once in PLy_spi_exception_set. So
0001 attached is a quick refactoring patch to remove that code
duplication, and then 0002 is your patch adapted to that.

I also attempted to include a test case in 0002, but I'm not very
satisfied with that. Your original test case seemed pretty expensive
for the amount of code coverage it adds, so I tried to make it happen
with debug_parallel_query instead. That does exercise the new code,
but it does not exhibit the crash if run against unpatched code.
That's because with this test case the error is only thrown in worker
processes not the leader, so we don't end up with corrupted Python
state in the leader. That result also points up that the original
test case isn't very reliable for this either: you have to have
parallel_leader_participation on, and you have to have the leader
process at least one row, which makes it pretty timing-sensitive.
On top of all that, the test would become useless if we do eventually
get rid of the !IsInParallelMode restriction. So I'm kind of inclined
to not bother with a test case if this gets to be committed in this
form.

Thoughts anyone?

regards, tom lane

Attachment Content-Type Size
v2-0001-simplify-PLy_spi_exception_set-API.patch text/x-diff 3.6 KB
v2-0002-fix-plpython-subtrans-start-failure.patch text/x-diff 6.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-12-02 01:18:59 Re: processes stuck in shutdown following OOM/recovery
Previous Message Tomas Vondra 2023-12-02 00:23:08 Re: logical decoding and replication of sequences, take 2