| From: | Amit Langote <amitlangote09(at)gmail(dot)com> |
|---|---|
| To: | Evan Montgomery-Recht <montge(at)mianetworks(dot)net> |
| Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Eliminating SPI / SQL from some RI triggers - take 3 |
| Date: | 2026-04-08 01:23:59 |
| Message-ID: | CA+HiwqFK8rXTNknxV0MMe++7W08g3kY6eeLK21A1xCrK2Wuk8Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Evan,
On Tue, Apr 7, 2026 at 10:00 PM Evan Montgomery-Recht
<montge(at)mianetworks(dot)net> wrote:
>
> Hi Amit,
>
> First time contributing to this project, let me know if I missed
> something or need to adjust what I put together.
>
> I found a crash in the RI fast-path FK check code introduced by
> 2da86c1ef9b and extended by b7b27eb41a5. C-language extensions that
> use SPI to INSERT into tables with multiple FK constraints hit an
> assertion failure (or, without assertions, a server crash) when the
> batch callback fires. I discovered this via PostGIS's topology CI --
> toTopoGeom() uses SPI to insert into edge_data, which has 4 immediate
> FK constraints referencing node and face. PG 18 passes the same test;
> the master crashes.
>
> This appears to be a separate issue from Chao Li's deferred-trigger
> batching bug; the patches touch different files and don't conflict. I
> did do a regression test on the merge referenced both PostgreSQL and
> PostGIS (to validate that this works.)
>
> The problem: ri_FastPathGetEntry() opens pk_rel/idx_rel and creates
> TupleTableSlots, registering them with the current resource owner --
> the SPI portal's. The batch callback ri_FastPathTeardown() only fires
> when query_depth == 0 (via FireAfterTriggerBatchCallbacks), but by
> that time the inner portal has finished and its resource owner has
> released the relation references and TupleDesc pins. The teardown then
> tries to close relations whose refcounts are already zero:
>
> TRAP: failed Assert("rel->rd_refcnt > 0"), File: "relcache.c"
>
> RelationDecrementReferenceCount
> -> RelationClose -> index_close -> ri_FastPathTeardown
> -> ri_FastPathEndBatch -> FireAfterTriggerBatchCallbacks
> -> AfterTriggerEndQuery -> standard_ExecutorFinish
>
> and TupleDesc pins that are no longer tracked by any resource owner
> cause "tupdesc reference not owned by resource owner" errors.
>
> Note that simple PL/pgSQL functions don't trigger this because
> PL/pgSQL's SPI connection spans the entire function call, so the
> portal's resource owner outlives the batch callback. The crash
> requires nested SPI from a C extension, which creates a shorter-lived
> portal.
>
> The attached patch (against master, for application) fixes this by
> transferring both relation references and TupleDesc pins from the
> current resource owner to CurTransactionResourceOwner immediately
> after creating them in ri_FastPathGetEntry(). The transfer uses
> RelationIncrementReferenceCount / PinTupleDesc under the target owner
> followed by RelationDecrementReferenceCount / ReleaseTupleDesc under
> the original. I chose this over switching CurrentResourceOwner around
> the table_open/index_open calls because the latter also affects
> transient buffer pins acquired during catalog lookups inside those
> functions.
>
> ri_FastPathTeardown is updated to clear any buffered tuples (whose
> buffer pins belong to the current resource owner) before switching to
> CurTransactionResourceOwner for the close/drop operations.
>
> The patch also adds a test module (test_spi_func) with a C function
> that executes SQL via SPI_connect/SPI_execute/SPI_finish, since this
> crash cannot be triggered from PL/pgSQL. The test exercises the
> C-level SPI INSERT with multiple FK constraints, FK violations, and
> nested PL/pgSQL-calls-C-SPI (matching the PostGIS call pattern).
>
> This is purely a correctness fix with no performance or backward
> compatibility impact. No documentation changes are needed since this
> is an internal bug fix.
>
> The patch compiles cleanly and passes pgindent, clang-tidy, and
> cppcheck. All 247 regression subtests pass, along with the full meson
> test suite (370 ok, 0 fail, 21 skipped) (skipped due to hardware
> availability on my side this week). I also verified the PostGIS
> topology test (toTopoGeom) passes clean with no warnings, and tested
> abort paths (FK violation, transaction rollback, subtransaction abort
> via EXCEPTION blocks) (not in scope of PostgreSQL but more for my own
> verification that things work). Code coverage on the new lines is
> 100%. Tested on macOS (aarch64) and Linux (aarch64, via Docker).
Thanks for the report and the patch.
I'll need to study this one a bit more closely. Added an open item
for the time being:
https://wiki.postgresql.org/wiki/PostgreSQL_19_Open_Items
> Unrelated to my patch, SonarCloud flagged a potential issue in
> recheck_matched_pk_tuple() (line 3370): the function loops over
> ii_NumIndexKeyAttrs elements of the skeys array, but the caller in
> ri_FastPathFlushArray passes recheck_skey[1] -- an array of exactly
> one element. This is safe because ri_FastPathFlushArray is the
>
> single-column FK path, so ii_NumIndexKeyAttrs is always 1 there.
> However, the function signature doesn't communicate this constraint,
> which flags as CWE-125 (out-of-bounds read) / CERT C ARR30-C. Adding
> an nkeys parameter (like ri_FastPathProbeOne already has) would make
> the contract explicit.
Makes sense. Will push the attached patch for this.
--
Thanks, Amit Langote
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Add-nkeys-parameter-to-recheck_matched_pk_tuple.patch | application/octet-stream | 3.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Lukas Fittl | 2026-04-08 01:26:04 | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |
| Previous Message | Haibo Yan | 2026-04-08 01:21:45 | Re: Extract numeric filed in JSONB more effectively |