Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
Date: 2025-10-23 08:17:44
Message-ID: CABPTF7X3w9zQbkakb9BmhVU1j0RNs4BoOr1sw9h0d=PU+WLm1Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Sawada-san, Michael,

Thanks for your comments on this patch.

On Thu, Oct 23, 2025 at 8:28 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Oct 20, 2025 at 11:13 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> >
> > Hi,
> >
> > Thanks for looking into this.
> >
> > On Tue, Oct 21, 2025 at 1:05 PM Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
> > >
> > > On Tue, 21 Oct 2025 at 04:31, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > >
> > > > On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote:
> > > > > Indeed, these changes look correct.
> > > > > I wonder why b89e151054a0 did this place this way, hope we do not miss
> > > > > anything here.
> > > >
> > > > Perhaps a lack of time back in 2014? It feels like an item where we
> > > > would need to research a bit some of the past threads, and see if this
> > > > has been discussed, or if there were other potential alternatives
> > > > discussed. This is not saying that what you are doing in this
> > > > proposal is actually bad, but it's a bit hard to say what an
> > > > "algorithm" should look like in this specific code path with XID
> > > > manipulations. Perhaps since 2014, we may have other places in the
> > > > tree that share similar characteristics as what's done here.
> > > >
> > > > So it feels like this needs a bit more historical investigation first,
> > > > rather than saying that your proposal is the best choice on the table.

Following these suggestions, I carefully searched the mailing list
archives and found no reports of performance issues directly related
to this code path. I also examined other parts of the codebase for
similar patterns. Components like integerset might share some
characteristics with SnapBuildPurgeOlderTxn, but they have constraints
that make them not directly applicable here. I am not very familiar
with the whole tree, so the investigation might not be exhaustive.

> > >
> > > Commit b89e151054a0 comes from [0]
> > > Comment of SnapBuildPurgeCommittedTxn tracks to [1] (it was in form
> > > "XXX: Neater algorithm?")
> > >
> > > Between these two messages, it was not disucccesseed...
> > >
> > > I will also study other related threads like [2], but i don't think
> > > they will give more insight here.
> > >
> > > [0] https://www.postgresql.org/message-id/20140303162652.GB16654%40awork2.anarazel.de
> > > [1] https://www.postgresql.org/message-id/20140115002223.GA17204%40awork2.anarazel.de
> > > [2] https://www.postgresql.org/message-id/flat/20130914204913.GA4071%40awork2.anarazel.de#:~:text=20130914204913.GA4071%40awork2.anarazel.de
> >
> > Introducing logical decoding was a major feature, and I assume there
> > were more critical design decisions and trade-offs to consider at that
> > time.
> >
> > I proposed this refactor not because it is the most significant
> > optimization, but because it seems to be a low-hanging fruit with
> > clear benefits. By using an in-place two-pointer compaction, we can
> > eliminate the extra memory allocation and copy-back step without
> > introducing risks to this well-tested code path.
>
> I agree the proposed in-pace update is better than the current
> copy-and-iterating approach. While the benefit might not be visible as
> it's not a known hot-path, I find that the proposed patch makes sense
> to improve the current codes. It could help some workloads where there
> are many DDLs (e.g., creating temporary tables in many transactions).
>

I also agree this approach offers better efficiency within the scope of
current SnapBuildPurgeOlderTxn.

> >
> > A comparable optimization exists in KnownAssignedXidsCompress() which
> > uses the same algorithm to remove stale XIDs without workspace
> > allocation. That implementation also adds a lazy compaction heuristic
> > that delays compaction until a threshold of removed entries is
> > reached, amortizing the O(N) cost across multiple operations.
> >
> > The comment above the data structure mentions the trade-off of keeping
> > the committed.xip array sorted versus unsorted. If the array were
> > sorted, we could use a binary search combined with memmove to compact
> > it efficiently, achieving O(log n + n) complexity for purging.
> > However, that design would increase the complexity of
> > SnapBuildAddCommittedTxn from O(1) to O(n) and "more complicated wrt
> > wraparound".
> >
> > /*
> > * Array of committed transactions that have modified the catalog.
> > *
> > * As this array is frequently modified we do *not* keep it in
> > * xidComparator order. Instead we sort the array when building &
> > * distributing a snapshot.
> > *
> > * TODO: It's unclear whether that reasoning has much merit. Every
> > * time we add something here after becoming consistent will also
> > * require distributing a snapshot. Storing them sorted would
> > * potentially also make it easier to purge (but more complicated wrt
> > * wraparound?). Should be improved if sorting while building the
> > * snapshot shows up in profiles.
> > */
> > TransactionId *xip;
> > } committed;
> >
> > /*
> > * Keep track of a new catalog changing transaction that has committed.
> > */
> > static void
> > SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
> > {
> > Assert(TransactionIdIsValid(xid));
> >
> > if (builder->committed.xcnt == builder->committed.xcnt_space)
> > {
> > builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
> >
> > elog(DEBUG1, "increasing space for committed transactions to %u",
> > (uint32) builder->committed.xcnt_space);
> >
> > builder->committed.xip = repalloc(builder->committed.xip,
> > builder->committed.xcnt_space * sizeof(TransactionId));
> > }
> >
> > /*
> > * TODO: It might make sense to keep the array sorted here instead of
> > * doing it every time we build a new snapshot. On the other hand this
> > * gets called repeatedly when a transaction with subtransactions commits.
> > */
> > builder->committed.xip[builder->committed.xcnt++] = xid;
> > }
> >
> > It might be worth profiling this function to evaluate whether
> > maintaining a sorted array could bring potential benefits, although
> > accurately measuring its end-to-end impact could be difficult if it
> > isn’t a known hotspot. I also did a brief search on the mailing list
> > and found no reports of performance concerns or related proposals to
> > optimize this part of the code.
>
> It might also be worth researching what kind of workloads would need a
> better algorithm in terms of storing/updating xip and subxip arrays
> since it would be the primary motivation. Also, otherwise we cannot
> measure the real-world impact of a new algorithm. Having said that,
> find that it would be discussed and developed separately from the
> proposed patch on this thread.

+1 for researching the workloads that might need a sorted array and
more efficient algorithm. This exploration isn’t limited to the scope
of SnapBuildPurgeOlderTxn but relates more broadly to the overall
snapbuild process, which might be worth discussing in a separate
thread as suggested.

* TODO: It's unclear whether that reasoning has much merit. Every
* time we add something here after becoming consistent will also
* require distributing a snapshot. Storing them sorted would
* potentially also make it easier to purge (but more complicated wrt
* wraparound?). Should be improved if sorting while building the
* snapshot shows up in profiles.

I also constructed an artificial workload to try to surface the qsort
call in SnapBuildBuildSnapshot, though such a scenario seems very
unlikely to occur in production.

for ((c=1; c<=DDL_CLIENTS; c++)); do
(
local seq=1
while (( $(date +%s) < tB_end )); do
local tbl="hp_ddl_${c}_$seq"
"$psql" -h 127.0.0.1 -p "$PORT" -d postgres -c "
BEGIN;
CREATE TABLE ${tbl} (id int, data text);
CREATE INDEX idx_${tbl} ON ${tbl} (id);
INSERT INTO ${tbl} VALUES ($seq, 'd');
DROP TABLE ${tbl};
COMMIT;" >/dev/null 2>&1 || true
seq=$((seq+1))
done
)

97.02% 0.00% postgres postgres [.] postmaster_child_launch
|
---postmaster_child_launch
|
|--94.93%--BackendMain
| PostgresMain
| exec_replication_command
| StartLogicalReplication
| |
| --94.92%--WalSndLoop
| |
| |--92.24%--XLogSendLogical
| | |
| |
--91.63%--LogicalDecodingProcessRecord
| | |
| |
|--89.55%--xact_decode
| | |
|
| | |
--89.23%--DecodeCommit
| | |
|
| | |
|--64.64%--SnapBuildCommitTxn
| | |
| |
| | |
| |--62.60%--SnapBuildBuildSnapshot
| | |
| | |
| | |
| | --62.33%--pg_qsort
| | |
| | |
| | |
| | |--56.84%--pg_qsort

Best,
Xuneng

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message XYenon 2025-10-23 08:17:56 Proposal: Allow excluding specific file patterns in pg_checksums
Previous Message Chao Li 2025-10-23 08:16:27 Re: Optimize LISTEN/NOTIFY