| From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
|---|---|
| To: | Nikolay Samokhvalov <nik(at)postgres(dot)ai> |
| Cc: | pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com> |
| Subject: | Re: xact_rollback spikes when logical walsender exits |
| Date: | 2026-04-20 16:35:34 |
| Message-ID: | CAHGQGwFDv6=Dcbf1YbGH5S7y-M4ar4-zC-c1GWco_Cn_SE0c7w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Sat, Apr 18, 2026 at 12:15 AM Nikolay Samokhvalov <nik(at)postgres(dot)ai> wrote:
>
> Hi hackers,
>
> There is a bug on logical-replication publishers where every decoded
> committed transaction bumps pg_stat_database.xact_rollback.
> ReorderBufferProcessTXN() ends each decoded transaction with
> AbortCurrentTransaction() for catalog cleanup; in the walsender that
> is a top-level abort, so AtEOXact_PgStat_Database(isCommit=false)
> increments the backend-local pgStatXactRollback.
>
> The counts are flushed to shared stats on walsender exit, producing
> an acute spike. Result: for production systems with SREs on call and tight
> alerting on xact_rollback, this turns routine logical-replication operations
> (disabling a subscription, dropping a slot, walsender restart) into
> false-positive pages.
>
> Reported in [1]; also experienced at GitLab [2][3][4].
>
> Attaching a simple patch that adds a backend-local flag pgStatXactSkipCounters
> in pgstat_database.c that AtEOXact_PgStat_Database() honors to skip
> the counter bump.
>
> Added TAP test that fails on master with 5/0 and passes with the patch.
>
> If there is agreement on this shape, happy to send patches for all
> supported branches. Let me know what you think.
Thanks for the report and patch!
How to implement a solution depends on what xact_rollback in pg_stat_database
is intended to mean. So at first we should consider which rollbacks should
it count? The documentation does not currently give an explicit definition.
At present, xact_rollback appears to count all rollbacks, explicit or implicit,
by any process connected to the database, including regular backends,
autovacuum workers, and logical walsenders. If that is the intended definition,
then rollbacks implicitly performed by logical walsenders during logical
replication should also be counted. Of course, even if we keep that definition,
the sudden increase in xact_rollback might still be a problem, so we might
need to call pgstat_report_stat() immediately after pgstat_flush_io() in
walsender, so the counters continue to be updated periodically during
logical replication.
On the other hand, your patch seems to assume a different definition: that
xact_rollback should count all explicit and implicit rollbacks, except those
performed by logical walsenders during logical replication. That would be
one possible approach, although it seems a bit odd to exclude only one subset
of rollbacks.
A third option would be to define xact_rollback more narrowly, counting only
rollbacks by regular backends, and excluding rollbacks by processes such as
autovacuum or walsender. At least in my view, xact_commit and xact_rollback
in pg_stat_database are typically used by DBAs to check whether
client transactions are committing or rolling back as expected. From
that perspective, it seems intuitive for xact_rollback to count only rollbacks
by regular backends. But others may reasonably see it differently.
Regards,
--
Fujii Masao
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jim Jones | 2026-04-20 17:33:27 | Re: Truncate logs by max_log_size |
| Previous Message | Alvaro Herrera | 2026-04-20 16:30:25 | Re: Adding REPACK [concurrently] |