Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
Date: 2025-10-29 12:17:47
Message-ID: CABPTF7Uf2_71fL-rLirhptW3EN4sW9Y-FnJv6UnBWs4YDd7kiw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sat, Oct 25, 2025 at 2:36 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Oct 23, 2025 at 1:17 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> >
> > Hi Sawada-san, Michael,
> >
> > Thanks for your comments on this patch.
> >
> > On Thu, Oct 23, 2025 at 8:28 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Oct 20, 2025 at 11:13 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks for looking into this.
> > > >
> > > > On Tue, Oct 21, 2025 at 1:05 PM Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Tue, 21 Oct 2025 at 04:31, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > > > >
> > > > > > On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote:
> > > > > > > Indeed, these changes look correct.
> > > > > > > I wonder why b89e151054a0 did this place this way, hope we do not miss
> > > > > > > anything here.
> > > > > >
> > > > > > Perhaps a lack of time back in 2014? It feels like an item where we
> > > > > > would need to research a bit some of the past threads, and see if this
> > > > > > has been discussed, or if there were other potential alternatives
> > > > > > discussed. This is not saying that what you are doing in this
> > > > > > proposal is actually bad, but it's a bit hard to say what an
> > > > > > "algorithm" should look like in this specific code path with XID
> > > > > > manipulations. Perhaps since 2014, we may have other places in the
> > > > > > tree that share similar characteristics as what's done here.
> > > > > >
> > > > > > So it feels like this needs a bit more historical investigation first,
> > > > > > rather than saying that your proposal is the best choice on the table.
> >
> > Following these suggestions, I carefully searched the mailing list
> > archives and found no reports of performance issues directly related
> > to this code path. I also examined other parts of the codebase for
> > similar patterns. Components like integerset might share some
> > characteristics with SnapBuildPurgeOlderTxn, but they have constraints
> > that make them not directly applicable here. I am not very familiar
> > with the whole tree, so the investigation might not be exhaustive.
>
> Thank you for looking through the archives. In logical replication,
> performance problems typically show up as replication delays. Since
> logical replication involves many different components and processes,
> it's quite rare for investigations to trace problems back to this
> specific piece of code. However, I still believe it's important to
> optimize the performance of logical decoding itself.
>
> >
> > > >
> > > > A comparable optimization exists in KnownAssignedXidsCompress() which
> > > > uses the same algorithm to remove stale XIDs without workspace
> > > > allocation. That implementation also adds a lazy compaction heuristic
> > > > that delays compaction until a threshold of removed entries is
> > > > reached, amortizing the O(N) cost across multiple operations.
> > > >
> > > > The comment above the data structure mentions the trade-off of keeping
> > > > the committed.xip array sorted versus unsorted. If the array were
> > > > sorted, we could use a binary search combined with memmove to compact
> > > > it efficiently, achieving O(log n + n) complexity for purging.
> > > > However, that design would increase the complexity of
> > > > SnapBuildAddCommittedTxn from O(1) to O(n) and "more complicated wrt
> > > > wraparound".
> > > >
> > > > /*
> > > > * Array of committed transactions that have modified the catalog.
> > > > *
> > > > * As this array is frequently modified we do *not* keep it in
> > > > * xidComparator order. Instead we sort the array when building &
> > > > * distributing a snapshot.
> > > > *
> > > > * TODO: It's unclear whether that reasoning has much merit. Every
> > > > * time we add something here after becoming consistent will also
> > > > * require distributing a snapshot. Storing them sorted would
> > > > * potentially also make it easier to purge (but more complicated wrt
> > > > * wraparound?). Should be improved if sorting while building the
> > > > * snapshot shows up in profiles.
> > > > */
> > > > TransactionId *xip;
> > > > } committed;
> > > >
> > > > /*
> > > > * Keep track of a new catalog changing transaction that has committed.
> > > > */
> > > > static void
> > > > SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
> > > > {
> > > > Assert(TransactionIdIsValid(xid));
> > > >
> > > > if (builder->committed.xcnt == builder->committed.xcnt_space)
> > > > {
> > > > builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
> > > >
> > > > elog(DEBUG1, "increasing space for committed transactions to %u",
> > > > (uint32) builder->committed.xcnt_space);
> > > >
> > > > builder->committed.xip = repalloc(builder->committed.xip,
> > > > builder->committed.xcnt_space * sizeof(TransactionId));
> > > > }
> > > >
> > > > /*
> > > > * TODO: It might make sense to keep the array sorted here instead of
> > > > * doing it every time we build a new snapshot. On the other hand this
> > > > * gets called repeatedly when a transaction with subtransactions commits.
> > > > */
> > > > builder->committed.xip[builder->committed.xcnt++] = xid;
> > > > }
> > > >
> > > > It might be worth profiling this function to evaluate whether
> > > > maintaining a sorted array could bring potential benefits, although
> > > > accurately measuring its end-to-end impact could be difficult if it
> > > > isn’t a known hotspot. I also did a brief search on the mailing list
> > > > and found no reports of performance concerns or related proposals to
> > > > optimize this part of the code.
> > >
> > > It might also be worth researching what kind of workloads would need a
> > > better algorithm in terms of storing/updating xip and subxip arrays
> > > since it would be the primary motivation. Also, otherwise we cannot
> > > measure the real-world impact of a new algorithm. Having said that,
> > > find that it would be discussed and developed separately from the
> > > proposed patch on this thread.
> >
> > +1 for researching the workloads that might need a sorted array and
> > more efficient algorithm. This exploration isn’t limited to the scope
> > of SnapBuildPurgeOlderTxn but relates more broadly to the overall
> > snapbuild process, which might be worth discussing in a separate
> > thread as suggested.
> >
> > * TODO: It's unclear whether that reasoning has much merit. Every
> > * time we add something here after becoming consistent will also
> > * require distributing a snapshot. Storing them sorted would
> > * potentially also make it easier to purge (but more complicated wrt
> > * wraparound?). Should be improved if sorting while building the
> > * snapshot shows up in profiles.
> >
> > I also constructed an artificial workload to try to surface the qsort
> > call in SnapBuildBuildSnapshot, though such a scenario seems very
> > unlikely to occur in production.
> >
> > for ((c=1; c<=DDL_CLIENTS; c++)); do
> > (
> > local seq=1
> > while (( $(date +%s) < tB_end )); do
> > local tbl="hp_ddl_${c}_$seq"
> > "$psql" -h 127.0.0.1 -p "$PORT" -d postgres -c "
> > BEGIN;
> > CREATE TABLE ${tbl} (id int, data text);
> > CREATE INDEX idx_${tbl} ON ${tbl} (id);
> > INSERT INTO ${tbl} VALUES ($seq, 'd');
> > DROP TABLE ${tbl};
> > COMMIT;" >/dev/null 2>&1 || true
> > seq=$((seq+1))
> > done
> > )
>
> Interesting. To be honest, I think this scenario might actually occur
> in practice, especially in cases where users frequently use CREATE
> TEMP TABLE ... ON COMMIT DROP.

The attached file shows the flamegraph for this workload.

for ((c=1; c<=CLIENTS_DDL; c++)); do
(
local seq=1
while (( $(date +%s) < tB_end )); do
local tbl="hp_temp_${c}_$seq"
$psql_cmd -d postgres -c "
BEGIN;
CREATE TEMP TABLE ${tbl} (id int, data text) ON COMMIT DROP;
CREATE INDEX idx_${tbl} ON ${tbl} (id);
INSERT INTO ${tbl} VALUES ($seq, 'd');
COMMIT;" >/dev/null 2>&1 || true
seq=$((seq+1))
done
) &
pidsB+=($!)
done

I’ve been thinking about possible ways to maintain a sorted array to
optimize this case and purge function. Below are some preliminary
ideas. If any of them seem reasonable, I’d like to start a new thread
and prepare a patch based on this direction.

1) Batch insertion with globally sorted array (raw uint32 order)

Goal: Maintain committed.xip sorted and deduplicated to eliminate
repeated qsorts.

Mechanism:
In SnapBuildCommitTxn, collect all XIDs/subxids to be tracked into a
local vector (length m)
Sort and deduplicate the batch: O(m log m)
Reverse-merge the batch into the global array: O(N + m), deduplicating
on the fly
Invariant: committed.xip[0..xcnt) is always raw-sorted (xidComparator
order) and deduplicated

Complexity improvement:
Before: O((N+m) log (N+m)) per build
After: O(m log m + N) per commit; snapshot build can skip qsort entirely

Purge:
Remains O(N) linear scan. Raw sorting doesn't enable binary search
because the purge predicate uses wraparound-aware semantics
(NormalTransactionIdPrecedes), and committed.xip can span epochs, so
numeric order ≠ logical XID order.

2) Adopt FullTransactionId for sublinear purge (theoretically possible)

Rationale:
To make purge O(log N), the array needs to be sorted under the same
ordering relation as the purge predicate. Wraparound-aware comparison
of 32-bit XIDs is not a total order when the set spans epochs.
FullTransactionId (epoch<<32 | xid) could provide a true total order.

Mechanism:
Store FullTransactionId *fxip instead of TransactionId *xip
struct
{
size_t xcnt;
size_t xcnt_space;
bool includes_all_transactions;
FullTransactionId *fxip; /* Changed from TransactionId *xip */
} committed;

Insertion: Map each 32-bit XID to FullTransactionId using a snapshot
of nextFullXid (same logic as hot standby feedback); sort/dedup batch;
reverse-merge O(N + m)

Purge: Compute xmin_full using the same snapshot; binary search for
lower bound O(log N) + memmove suffix

/*
* xidLogicalComparator
* qsort comparison function for XIDs
*
* This is used to compare only XIDs from the same epoch (e.g. for backends
* running at the same time). So there must be only normal XIDs, so there's
* no issue with triangle inequality.
*/
int
xidLogicalComparator(const void *arg1, const void *arg2)

Trade-offs:
Downcasting a logically sorted FullTransactionId array can break raw
uint32 ordering if the set spans an epoch boundary. Still need a qsort
on snapshot->xip after copying xids from committed.xip at worst case.

Memory/I/O: 2× bytes for committed array (4→8 bytes per entry)

Purge improves from O(N) to O(log N + move); merge complexity unchanged

Best,
Xuneng

Attachment Content-Type Size
profile_result.svg image/svg+xml 160.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2025-10-29 12:19:08 Re: Add uuid_to_base32hex() and base32hex_to_uuid() built-in functions
Previous Message Amit Kapila 2025-10-29 12:10:01 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart