| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
| Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)kurilemu(dot)de>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, jian he <jian(dot)universality(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> |
| Subject: | Re: Implement waiting for wal lsn replay: reloaded |
| Date: | 2026-01-06 17:04:06 |
| Message-ID: | CABPTF7X39KknhC+xMbJgaJ1ydS9Ly5hWYF-Z5WtkcQuyPMNw-A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Tue, Jan 6, 2026 at 11:58 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> Hi,
>
>
> On Tue, Jan 6, 2026 at 11:34 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 6, 2026 at 3:12 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> > > > > On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > > > > > Could this be causing the recent flapping failures on CI/macOS in
> > > > > > recovery/031_recovery_conflict? I didn't have time to dig personally
> > > > > > but f30848cb looks relevant:
> > > > > >
> > > > > > Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
> > > > > > error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
> > > > > > conflict with recovery
> > > > > > DETAIL: User was or might have been using tablespace that must be dropped.'
> > > > > > while running 'psql --no-psqlrc --no-align --tuples-only --quiet
> > > > > > --dbname port=25195
> > > > > > host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
> > > > > > dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
> > > > > > FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
> > > > > > no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
> > > > > > line 2300.
> > > > > >
> > > > > > https://cirrus-ci.com/task/5771274900733952
> > > > > >
> > > > > > The master branch in time-descending order, macOS tasks only:
> > > > > >
> > > > > > task_id | substring | status
> > > > > > ------------------+-----------+-----------
> > > > > > 6460882231754752 | c970bdc0 | FAILED
> > > > > > 5771274900733952 | 6ca8506e | FAILED
> > > > > > 6217757068361728 | 63ed3bc7 | FAILED
> > > > > > 5980650261446656 | ae283736 | FAILED
> > > > > > 6585898394976256 | 5f13999a | COMPLETED
> > > > > > 4527474786172928 | 7f9acc9b | COMPLETED
> > > > > > 4826100842364928 | e8d4e94a | COMPLETED
> > > > > > 4540563027918848 | b9ee5f2d | FAILED
> > > > > > 6358528648019968 | c5af141c | FAILED
> > > > > > 5998005284765696 | e212a0f8 | COMPLETED
> > > > > > 6488580526178304 | b85d5dc0 | FAILED
> > > > > > 5034091344560128 | 7dc95cc3 | ABORTED
> > > > > > 5688692477526016 | bb048e31 | COMPLETED
> > > > > > 5481187977723904 | d351063e | COMPLETED
> > > > > > 5101831568752640 | f30848cb | COMPLETED <-- the change
> > > > > > 6395317408497664 | 3f33b63d | COMPLETED
> > > > > > 6741325208354816 | 877ae5db | COMPLETED
> > > > > > 4594007789010944 | de746e0d | COMPLETED
> > > > > > 6497208998035456 | 461b8cc9 | COMPLETED
> > > > >
> > > > > Thanks for raising this issue. I think it is related to f30848cb after
> > > > > some analysis. I'll prepare a follow-up patch to fix it.
> > > >
> > > > Sorry, I've mistakenly referenced this report from commit [1]. I
> > > > thought it was related, but it appears to be not. [1] is related to
> > > > the report I've got from Ruikai Peng off-list.
> > > >
> > > > Regarding the present failure, could it happen before ExecWaitStmt()
> > > > calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we
> > > > should do preliminary efforts to release these snapshots.
> > > >
> > > > 1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f
> > > >
> > >
> > > I agree that moving PopActiveSnapshot() and
> > > InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt()
> > > appears to be a sensible optimization. However, in this particular
> > > failure scenario, it may not address the issue.
> > >
> > > For tablespace conflicts, recovery conflict resolution uses
> > > GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which
> > > returns all active backends, regardless of their snapshot state. As a
> > > result, even if all snapshots are released at the start of
> > > ExecWaitStmt(), the session would still be canceled during replay of
> > > DROP TABLESPACE.
> >
> > GetConflictingVirtualXIDs() uses proc->xmin to detect the conflicts.
> > ExecWaitStmt() asserts MyProc->xmin == InvalidTransactionId after
> > releasing all the snapshots. I still think this happens because
> > conflict handling happens before ExecWaitStmt() manages to release the
> > snapshots.
> >
>
> I did not notice this message before. I'll look more closely at this case.
# VACUUM FREEZE, pruning those dead tuples
$node_primary->safe_psql($test_db, qq[VACUUM FREEZE $table1;]);
# Wait for attempted replay of PRUNE records
$node_primary->wait_for_replay_catchup($node_standby);
check_conflict_log(
"User query might have needed to see row versions that must be removed");
$psql_standby->reconnect_and_clear();
check_conflict_stat("snapshot");
Yeah, this code path could be problematic for the conflict type
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT. I created a patch to reduce the
false conflict detecting window as you suggested. Please check it too.
--
Best,
Xuneng
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Move-snapshot-release-to-the-beginning-of-ExecWai.patch | application/octet-stream | 4.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Melanie Plageman | 2026-01-06 17:31:57 | Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) |
| Previous Message | David G. Johnston | 2026-01-06 17:00:29 | Re: pg18 bug? SELECT query doesn't work |