BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions

From: marco(dot)nenciarini(at)2ndquadrant(dot)it
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions
Date: 2015-06-26 13:43:10
Message-ID: 20150626134310.3876.82768@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

The following bug has been logged on the website:

Bug reference: 13473
Logged by: Marco Nenciarini
Email address: marco(dot)nenciarini(at)2ndquadrant(dot)it
PostgreSQL version: 9.4.4
Operating system: all
Description:

= Symptoms

Let's have a simple master -> standby setup, with hot_standby_feedback
activated,
if a backend on standby is holding the cluster xmin and the master runs a
VACUUM FREEZE
on the same database of the standby's backend, it will generate a conflict
and the query
running on standby will be canceled.

= How to reproduce it

Run the following operation on an idle cluster.

1) connect to the standby and simulate a long running query:

select pg_sleep(3600);

2) connect to the master and run the following script

create table t(id int primary key);
insert into t select generate_series(1, 10000);
vacuum freeze verbose t;
drop table t;

3) after 30 seconds the pg_sleep query on standby will be canceled.

= Expected output

The hot standby feedback should have prevented the query cancellation

= Analysis

Ive run postgres at DEBUG2 logging level, and I can confirm that the vacuum
correctly see the OldestXmin propagated by the standby through the hot
standby feedback.
The issue is in heap_xlog_freeze function, which calls
ResolveRecoveryConflictWithSnapshot as first thing, passing the cutoff_xid
value as first argument.
The cutoff_xid is the OldestXmin active when the vacuum, so it represents a
running xid.
The issue is that the function ResolveRecoveryConflictWithSnapshot expects
as first argument of is latestRemovedXid, which represent the higher xid
that has been actually removed, so there is an off-by-one error.

I've been able to reproduce this issue for every version of postgres since
9.0 (9.0, 9.1, 9.2, 9.3, 9.4 and current master)

= Proposed solution

In the heap_xlog_freeze we need to subtract one to the value of cutoff_xid
before passing it to ResolveRecoveryConflictWithSnapshot.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Marco Nenciarini 2015-06-26 13:50:41 Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions
Previous Message matthew.seaman 2015-06-26 11:08:30 BUG #13472: VACUUM ANALYZE hangs on certain tables

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-06-26 13:43:26 Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Previous Message Robert Haas 2015-06-26 13:42:48 Re: Schedule for 9.5alpha1