| From: | Junwang Zhao <zhjwpku(at)gmail(dot)com> |
|---|---|
| To: | Amit Langote <amitlangote09(at)gmail(dot)com> |
| Cc: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Eliminating SPI / SQL from some RI triggers - take 3 |
| Date: | 2026-03-10 12:28:47 |
| Message-ID: | CAEG8a3+nUFQo4sdPQF9xy0J73J8RFJ5U9A5+_kMosGDaZ+1sXA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Mon, Mar 2, 2026 at 11:30 PM Junwang Zhao <zhjwpku(at)gmail(dot)com> wrote:
>
> On Sat, Feb 28, 2026 at 3:08 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> >
> > Hi Junwang,
> >
> > On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku(at)gmail(dot)com> wrote:
> > > On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > > > I re-ran the benchmarks (same test as yours, different machine):
> > > >
> > > > create table pk (a numeric primary key);
> > > > create table fk (a bigint references pk);
> > > > insert into pk select generate_series(1, 2000000);
> > > > insert into fk select generate_series(1, 2000000, 2);
> > > >
> > > > master: 2444 ms (median of 3 runs)
> > > > 0001: 1382 ms (43% faster)
> > > > 0001+0002: 1202 ms (51% faster, 13% over 0001 alone)
> > >
> > > I can get similar improvement on my old mac intel chip:
> > >
> > > master: 12963.993 ms
> > > 0001: 6641.692 ms, 48.8% faster
> > > 0001+0002: 5771.703 ms, 55.5% faster
> > > >
> > > > Also, with int PK / int FK (1M rows):
> > > >
> > > > create table pk (a int primary key);
> > > > create table fk (a int references pk);
> > > > insert into pk select generate_series(1, 1000000);
> > > > insert into fk select generate_series(1, 1000000);
> > > >
> > > > master: 1000 ms
> > > > 0001: 520 ms (48% faster)
> > > > 0001+0002: 432 ms (57% faster, 17% over 0001 alone)
> > >
> > > master: 11134.583 ms
> > > 0001: 5240.298 ms, 52.9% faster
> > > 0001+0002: 4554.215 ms, 59.1% faster
> >
> > Thanks for testing, good to see similar numbers. I had forgotten to
> > note that these results are when these PK index probes don't do any
> > I/O, though you might be aware of that. Below, I report some numbers
> > that Tomas Vondra shared with me off-list where the probes do have to
> > perform I/O and there the benefits from only this patch set are only
> > marginal.
> >
> > > I don't have any additional comments on the patch except one minor nit,
> > > maybe merge the following two if conditions into one, not a strong opinion
> > > though.
> > >
> > > if (use_cache)
> > > {
> > > /*
> > > * The snapshot was registered once when the cache entry was created.
> > > * We just patch curcid to reflect the new command counter.
> > > * SnapshotSetCommandId() only patches process-global statics, not
> > > * registered copies, so we do it directly.
> > > *
> > > * The xmin/xmax/xip fields don't need refreshing: within a single
> > > * statement batch, only curcid changes between rows.
> > > */
> > > Assert(fpentry && fpentry->snapshot != NULL);
> > > snapshot = fpentry->snapshot;
> > > snapshot->curcid = GetCurrentCommandId(false);
> > > }
> > > else
> > > snapshot = RegisterSnapshot(GetLatestSnapshot());
> > >
> > > if (use_cache)
> > > {
> > > pk_rel = fpentry->pk_rel;
> > > idx_rel = fpentry->idx_rel;
> > > scandesc = fpentry->scandesc;
> > > slot = fpentry->slot;
> > > }
> > > else
> > > {
> > > pk_rel = table_open(riinfo->pk_relid, RowShareLock);
> > > idx_rel = index_open(riinfo->conindid, AccessShareLock);
> > > scandesc = index_beginscan(pk_rel, idx_rel,
> > > snapshot, NULL,
> > > riinfo->nkeys, 0);
> > > slot = table_slot_create(pk_rel, NULL);
> > > }
> >
> > Good idea, done.
> >
> > While polishing 0002, I revisited the snapshot caching semantics. The
> > previous commit message hand-waved about only curcid changing between
> > rows, but GetLatestSnapshot() also reflects other backends' commits,
> > so reusing the snapshot is a deliberate semantic change from the SPI
> > path. I think it's safe because curcid is all we need for
> > intra-statement visibility, concurrent commits either already happened
> > before our snapshot (and are visible) or are racing with our statement
> > and wouldn't be seen reliably even with per-row snapshots since the
> > order in which FK rows are checked is nondeterministic, and
> > LockTupleKeyShare prevents the PK row from disappearing regardless. In
> > essence, we're treating all the FK checks within a trigger-firing
> > cycle as a single plan execution that happens to scan N rows, rather
> > than N independent SPI queries each taking a fresh snapshot. That's
> > the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
> > re-take GetLatestSnapshot() between rows either.
> >
> > Similarly, the permission check (schema USAGE + table SELECT) is now
> > done once at cache entry creation in ri_FastPathGetEntry() rather than
> > on every flush.
>
> nice improvement.
>
> > The RI check runs as the PK table owner, so we're
> > verifying that the owner can access their own table -- a condition
> > that won't change unless someone explicitly revokes from the owner,
> > which would also break the SPI path.
> >
> > > > David Rowley mentioned off-list that it might be worth batching
> > > > multiple FK values into a single index probe, leveraging the
> > > > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
> > > > to buffer FK values across trigger invocations in the per-constraint
> > > > cache (0002 already has the right structure for this), build a
> > > > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
> > > > pages in one sorted traversal instead of one tree descent per row. The
> > > > locking and recheck would still be per-tuple, but the index traversal
> > > > cost drops significantly. Single-column FKs are the obvious starting
> > > > point. That seems worth exploring but can be done as a separate patch
> > > > on top of this.
> > >
> > > I will take a look at this in the following weeks.
> >
> > I ended up going ahead with the batching and SAOP idea that David
> > mentioned -- I had a proof-of-concept working shortly after posting v3
> > and kept iterating on it. So attached set is now:
> >
> > 0001 - Core fast path (your 0001+0002 reworked, as before)
> >
> > 0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot)
> >
> > 0003 - FK row buffering: materialize FK tuples into a per-constraint
> > batch buffer (64 rows), flush when full or at batch end
> >
> > 0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
> > buffered FK values and do one index scan instead of 64 separate tree
> > descents. Multi-column FKs fall back to a per-row loop.
> >
> > 0003 is pure infrastructure -- it doesn't improve performance on its
> > own because the per-row index descent still dominates. The payoff
> > comes in 0004.
> >
> > Numbers (same machine as before, median of 3 runs):
> >
> > numeric PK / bigint FK, 1M rows:
> > master: 2487 ms
> > 0001..0004: 1168 ms (2.1x)
> >
> > int PK / int FK, 500K rows:
> > master: 1043 ms
> > 0001..0004: 335 ms (3.1x)
> >
> > The int/int case benefits most because the per-row cost is lower, so
> > the SAOP traversal savings are a larger fraction of the total. The
> > numeric/bigint case still sees a solid improvement despite the
> > cross-type cast overhead.
> >
> > Tomas Vondra also tested with an I/O-intensive workload (dataset
> > larger than shared_buffers, combined with his and Peter Geoghegan's
> > I/O prefetching patches) and confirmed that the batching + SAOP
> > approach helps there too, not just in the CPU-bound / memory-resident
> > case. In fact he showed that the patches here don't make a big dent
> > when the main bottleneck is I/O as shown in numbers that he shared in
> > an off-list email:
> >
> > master: 161617 ms
> > ri-check (0001..0004): 149446 ms (1.08x)
> > ri-check + i/o prefetching: 50885 ms (3.2x)
> >
> > So the RI patches alone only give ~8% here since most time is waiting
> > on reads. But the batching gives the prefetch machinery a window of
> > upcoming probes to issue readahead against, so the two together yield
> > 3.2x.
>
> impressive!
>
> >
> > Tomas also caught a memory context bug in the batch flush path: the
> > cached scandesc lives in TopTransactionContext, but the btree AM
> > defers _bt_preprocess_keys allocation to the first getnext call, which
> > pallocs into CurrentMemoryContext. If that's a short-lived
> > per-trigger-row context, the scandesc has dangling pointers on the
> > next rescan. Fixed by switching to TopTransactionContext before the
> > probe loop.
> >
> > Finally, I've fixed a number of other small and not-so-small bugs
> > found while polishing the old patches and made other stylistic
> > improvements. One notable change is that I introduced a FastPathMeta
>
> Yeah, this is much better than the fpmeta_valid field.
>
> > struct to store the fast path metadata instead of dumping those arrays
> > in the RI_ConstraintInfo. It's allocated lazily on first use and holds
> > the per-key compare entries, operator procedures, and index strategy
> > info needed by the scan key construction, so RI_ConstraintInfo doesn't
> > pay for them when the fast path isn't used.
> >
> >
> > On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku(at)gmail(dot)com> wrote:
> > >
> > > Hi Amit,
> > >
> > > On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > > >
> > > > Hi Junwang,
> > > >
> > > > On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku(at)gmail(dot)com> wrote:
> > > > > As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
> > > > > design.
> > > > >
> > > > > 0001 adds a fast path optimization for foreign key constraint checks
> > > > > that bypasses the SPI executor, the fast path applies when the referenced
> > > > > table is not partitioned, and the constraint does not involve temporal
> > > > > semantics.
> > > > >
> > > > > With the following test:
> > > > >
> > > > > create table pk (a numeric primary key);
> > > > > create table fk (a bigint references pk);
> > > > > insert into pk select generate_series(1, 2000000);
> > > > >
> > > > > head:
> > > > >
> > > > > [local] zhjwpku(at)postgres:5432-90419=# insert into fk select
> > > > > generate_series(1, 2000000, 2);
> > > > > INSERT 0 1000000
> > > > > Time: 13516.177 ms (00:13.516)
> > > > >
> > > > > [local] zhjwpku(at)postgres:5432-90419=# update fk set a = a + 1;
> > > > > UPDATE 1000000
> > > > > Time: 15057.638 ms (00:15.058)
> > > > >
> > > > > patched:
> > > > >
> > > > > [local] zhjwpku(at)postgres:5432-98673=# insert into fk select
> > > > > generate_series(1, 2000000, 2);
> > > > > INSERT 0 1000000
> > > > > Time: 8248.777 ms (00:08.249)
> > > > >
> > > > > [local] zhjwpku(at)postgres:5432-98673=# update fk set a = a + 1;
> > > > > UPDATE 1000000
> > > > > Time: 10117.002 ms (00:10.117)
> > > > >
> > > > > 0002 cache fast-path metadata used by the index probe, at the current
> > > > > time only comparison operator hash entries, operator function OIDs
> > > > > and strategy numbers and subtypes for index scans. But this cache
> > > > > doesn't buy any performance improvement.
> > > > >
> > > > > Caching additional metadata should improve performance for foreign key checks.
> > > > >
> > > > > Amit suggested introducing a mechanism for ri_triggers.c to register a
> > > > > cleanup callback in the EState, which AfterTriggerEndQuery() could then
> > > > > invoke to release per-statement cached metadata (such as the IndexScanDesc).
> > > > > However, I haven't been able to implement this mechanism yet.
> > > >
> > > > Thanks for working on this. I've taken your patches as a starting
> > > > point and reworked the series into two patches (attached): 1st is your
> > > > 0001+0002 as the core patch that adds a gated fast-path alternative to
> > > > SPI and 2nd where I added per-statement resource caching. Doing the
> > > > latter turned out to be not so hard thanks to the structure you chose
> > > > to build the core fast path. Good call on adding the RLS and ACL test
> > > > cases, btw.
> > > >
> > > > So, 0001 is a functionally complete fast path: concurrency handling,
> > > > REPEATABLE READ crosscheck, cross-type operators, security context,
> > > > and metadata caching. 0002 implements the per-statement resource
> > > > caching we discussed, though instead of sharing the EState between
> > > > trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
> > > > mechanism that fires at the end of each trigger-firing cycle
> > > > (per-statement for immediate constraints, or until COMMIT for deferred
> > > > ones). It layers resource caching on top so that the PK relation,
> > > > index, scan descriptor, and snapshot stay open across all FK trigger
> > > > invocations within a single trigger-firing cycle rather than being
> > > > opened and closed per row.
> > > >
> > > > Note that phe previous 0002 (metadata caching) is folded into 0001,
> > > > and most of the new fast-path logic added in 0001 now lives in
> > > > ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
> > > > RI_FKey_check diff is just the gating call and SPI fallback.
> > > >
> > > > I re-ran the benchmarks (same test as yours, different machine):
> > > >
> > > > create table pk (a numeric primary key);
> > > > create table fk (a bigint references pk);
> > > > insert into pk select generate_series(1, 2000000);
> > > > insert into fk select generate_series(1, 2000000, 2);
> > > >
> > > > master: 2444 ms (median of 3 runs)
> > > > 0001: 1382 ms (43% faster)
> > > > 0001+0002: 1202 ms (51% faster, 13% over 0001 alone)
> > >
> > > I can get similar improvement on my old mac intel chip:
> > >
> > > master: 12963.993 ms
> > > 0001: 6641.692 ms, 48.8% faster
> > > 0001+0002: 5771.703 ms, 55.5% faster
> > >
> > > >
> > > > Also, with int PK / int FK (1M rows):
> > > >
> > > > create table pk (a int primary key);
> > > > create table fk (a int references pk);
> > > > insert into pk select generate_series(1, 1000000);
> > > > insert into fk select generate_series(1, 1000000);
> > > >
> > > > master: 1000 ms
> > > > 0001: 520 ms (48% faster)
> > > > 0001+0002: 432 ms (57% faster, 17% over 0001 alone)
> > >
> > > master: 11134.583 ms
> > > 0001: 5240.298 ms, 52.9% faster
> > > 0001+0002: 4554.215 ms, 59.1% faster
> > >
> > > >
> > > > The incremental gain from 0002 comes from eliminating per-row relation
> > > > open/close, scan begin/end, slot alloc/free, and replacing per-row
> > > > GetSnapshotData() with only curcid adjustment on the registered
> > > > snapshot copy in the cache.
> > > >
> > > > The two current limitations are partitioned referenced tables and
> > > > temporal foreign keys. Partitioned PKs are relatively uncommon in
> > > > practice, so the non-partitioned case should cover most FK workloads,
> > > > so I'm not sure it's worth the added complexity to support them.
> > > > Temporal FKs are inherently multi-row, so they're a poor fit for a
> > > > single-probe fast path.
> > > >
> > > > David Rowley mentioned off-list that it might be worth batching
> > > > multiple FK values into a single index probe, leveraging the
> > > > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
> > > > to buffer FK values across trigger invocations in the per-constraint
> > > > cache (0002 already has the right structure for this), build a
> > > > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
> > > > pages in one sorted traversal instead of one tree descent per row. The
> > > > locking and recheck would still be per-tuple, but the index traversal
> > > > cost drops significantly. Single-column FKs are the obvious starting
> > > > point. That seems worth exploring but can be done as a separate patch
> > > > on top of this.
> > >
> > > I will take a look at this in the following weeks.
> > >
> > > >
> > > > I think the series is in reasonable shape but would appreciate extra
> > > > eyeballs, especially on the concurrency handling in ri_LockPKTuple()
> > > > in 0001 and the snapshot lifecycle in 0002. Or anything else that
> > > > catches one's eye.
> > > >
> > > > --
> > > > Thanks, Amit Langote
> > >
> > > I don't have any additional comments on the patch except one minor nit,
> > > maybe merge the following two if conditions into one, not a strong opinion
> > > though.
> > >
> > > if (use_cache)
> > > {
> > > /*
> > > * The snapshot was registered once when the cache entry was created.
> > > * We just patch curcid to reflect the new command counter.
> > > * SnapshotSetCommandId() only patches process-global statics, not
> > > * registered copies, so we do it directly.
> > > *
> > > * The xmin/xmax/xip fields don't need refreshing: within a single
> > > * statement batch, only curcid changes between rows.
> > > */
> > > Assert(fpentry && fpentry->snapshot != NULL);
> > > snapshot = fpentry->snapshot;
> > > snapshot->curcid = GetCurrentCommandId(false);
> > > }
> > > else
> > > snapshot = RegisterSnapshot(GetLatestSnapshot());
> > >
> > > if (use_cache)
> > > {
> > > pk_rel = fpentry->pk_rel;
> > > idx_rel = fpentry->idx_rel;
> > > scandesc = fpentry->scandesc;
> > > slot = fpentry->slot;
> > > }
> > > else
> > > {
> > > pk_rel = table_open(riinfo->pk_relid, RowShareLock);
> > > idx_rel = index_open(riinfo->conindid, AccessShareLock);
> > > scandesc = index_beginscan(pk_rel, idx_rel,
> > > snapshot, NULL,
> > > riinfo->nkeys, 0);
> > > slot = table_slot_create(pk_rel, NULL);
> > > }
> > >
> > > --
> > > Regards
> > > Junwang Zhao
> >
> >
> >
> > --
> > Thanks, Amit Langote
>
>
>
> --
> Regards
> Junwang Zhao
I had an offline discussion with Amit today. There were a few small things
that could be improved, so I posted a new version of the patch set.
1.
+ if (ri_fastpath_is_applicable(riinfo))
+ {
+ bool found = ri_FastPathCheck(riinfo, fk_rel, newslot);
+
+ if (found)
+ return PointerGetDatum(NULL);
+
+ /*
+ * ri_FastPathCheck opens pk_rel internally; we need it for
+ * ri_ReportViolation. Re-open briefly.
+ */
+ pk_rel = table_open(riinfo->pk_relid, RowShareLock);
+ ri_ReportViolation(riinfo, pk_rel, fk_rel,
+ newslot, NULL,
+ RI_PLAN_CHECK_LOOKUPPK, false, false);
+ }
Move ri_ReportViolation into ri_FastPathCheck, so table_open is no
longer needed, and ri_FastPathCheck now returns void. Since Amit
agreed this is the right approach, I included it directly in v5-0001.
2.
After adding the batch fast path, the original ri_FastPathCheck is only
used by the ALTER TABLE validation path. This path cannot use the
cache because the registered AfterTriggerBatch callback will never run.
Therefore, the use_cache branch can be removed.
I made this change in v5-0004 and also updated some related comments.
Once we agree the changes are correct, it can be merged into v5-0003.
3.
+ fk_slot = MakeSingleTupleTableSlot(RelationGetDescr(fk_rel),
+ &TTSOpsHeapTuple);
ri_FastPathBatchFlush creates a new fk_slot but does not cache it in
RI_FastPathEntry. I tried caching it in v5-0006 and ran some benchmarks,
it didn't show much improvement. This might be because the slot creation
function is called once per batch rather than once per row, so the overall
impact is minimal. I'm posting this here for Amit to take a look and decide
whether we should adopt it or drop it, since I mentioned the idea to
him earlier.
4.
ri_FastPathFlushArray currently uses SK_SEARCHARRAY only for
single-column checks. I asked whether this could be extended to support
multi-column cases, and Amit encouraged me to look into it.
After a brief investigation, it seems that ScanKeyEntryInitialize only allows
passing a single subtype/collation/procedure, which makes it difficult to
handle multiple types. Based on this, my current understanding is that
SK_SEARCHARRAY may not work for multi-column checks.
--
Regards
Junwang Zhao
| Attachment | Content-Type | Size |
|---|---|---|
| v5-0005-Use-SK_SEARCHARRAY-for-batched-fast-path-FK-probe.patch | application/octet-stream | 15.0 KB |
| v5-0006-Reuse-FK-tuple-slot-across-fast-path-batches.patch | application/octet-stream | 3.7 KB |
| v5-0002-Cache-per-batch-resources-for-fast-path-foreign-k.patch | application/octet-stream | 30.5 KB |
| v5-0004-Refine-fast-path-FK-validation-path.patch | application/octet-stream | 5.5 KB |
| v5-0003-Buffer-FK-rows-for-batched-fast-path-probing.patch | application/octet-stream | 12.4 KB |
| v5-0001-Add-fast-path-for-foreign-key-constraint-checks.patch | application/octet-stream | 30.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jim Jones | 2026-03-10 12:31:39 | Re: [PATCH] Add CANONICAL option to xmlserialize |
| Previous Message | Jet | 2026-03-10 12:27:33 | Re: Potential security risk associated with function call |