Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions

From: Marco Nenciarini <marco(dot)nenciarini(at)2ndquadrant(dot)it>
To: pgsql-bugs(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions
Date: 2015-06-26 13:50:41
Message-ID: 558D58B1.70400@2ndquadrant.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Il 26/06/15 15:43, marco(dot)nenciarini(at)2ndquadrant(dot)it ha scritto:
> The following bug has been logged on the website:
>
> Bug reference: 13473
> Logged by: Marco Nenciarini
> Email address: marco(dot)nenciarini(at)2ndquadrant(dot)it
> PostgreSQL version: 9.4.4
> Operating system: all
> Description:
>
> = Symptoms
>
> Let's have a simple master -> standby setup, with hot_standby_feedback
> activated,
> if a backend on standby is holding the cluster xmin and the master runs a
> VACUUM FREEZE
> on the same database of the standby's backend, it will generate a conflict
> and the query
> running on standby will be canceled.
>
> = How to reproduce it
>
> Run the following operation on an idle cluster.
>
> 1) connect to the standby and simulate a long running query:
>
> select pg_sleep(3600);
>
> 2) connect to the master and run the following script
>
> create table t(id int primary key);
> insert into t select generate_series(1, 10000);
> vacuum freeze verbose t;
> drop table t;
>
> 3) after 30 seconds the pg_sleep query on standby will be canceled.
>
> = Expected output
>
> The hot standby feedback should have prevented the query cancellation
>
> = Analysis
>
> Ive run postgres at DEBUG2 logging level, and I can confirm that the vacuum
> correctly see the OldestXmin propagated by the standby through the hot
> standby feedback.
> The issue is in heap_xlog_freeze function, which calls
> ResolveRecoveryConflictWithSnapshot as first thing, passing the cutoff_xid
> value as first argument.
> The cutoff_xid is the OldestXmin active when the vacuum, so it represents a
> running xid.
> The issue is that the function ResolveRecoveryConflictWithSnapshot expects
> as first argument of is latestRemovedXid, which represent the higher xid
> that has been actually removed, so there is an off-by-one error.
>
> I've been able to reproduce this issue for every version of postgres since
> 9.0 (9.0, 9.1, 9.2, 9.3, 9.4 and current master)
>
> = Proposed solution
>
> In the heap_xlog_freeze we need to subtract one to the value of cutoff_xid
> before passing it to ResolveRecoveryConflictWithSnapshot.
>
>
>

Attached a proposed patch that solves the issue.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco(dot)nenciarini(at)2ndQuadrant(dot)it | www.2ndQuadrant.it

Attachment Content-Type Size
hs_freeze_offby1.v1.patch text/plain 651 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2015-06-26 13:53:42 Re: BUG #13472: VACUUM ANALYZE hangs on certain tables
Previous Message marco.nenciarini 2015-06-26 13:43:10 BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-06-26 13:53:30 Re: Should we back-patch SSL renegotiation fixes?
Previous Message Andres Freund 2015-06-26 13:49:24 Re: Nitpicking: unnecessary NULL-pointer check in pg_upgrade's controldata.c