BUG #14721: Assertion of synchronous replication

From: const_sunny(at)126(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #14721: Assertion of synchronous replication
Date: 2017-06-29 02:36:23
Message-ID: 20170629023623.1480.26508@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14721
Logged by: Const Zhang
Email address: const_sunny(at)126(dot)com
PostgreSQL version: 9.6.2
Operating system: CentOS7
Description:

Hi all!

I have found a bug about synchronous replication.
At first, see the stack of the core file.

1
(gdb) bt
2
#0 0x00007fe9aab2e1d7 in raise () from /lib64/libc.so.6
3
#1 0x00007fe9aab2f8c8 in abort () from /lib64/libc.so.6
4
#2 0x0000000000af0699 in ExceptionalCondition (conditionName=0xcdc111
"!(SHMQueueIsDetached(&(MyProc->syncRepLinks)))", errorType=0xb6c443
"FailedAssertion",
5
fileName=0xcdc140
"/home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c",
lineNumber=294) at
/home/zl/workspace_pg962/postgres/src/backend/utils/error/assert.c:54
6
#3 0x00000000008c7e94 in SyncRepWaitForLSN (lsn=50435080, commit=1 '\001')
at /home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c:294
7
#4 0x000000000056ed11 in RecordTransactionCommit () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:1343
8
#5 0x0000000000568c96 in CommitTransaction () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:2041
9
#6 0x0000000000568717 in CommitTransactionCommand () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:2768
10
#7 0x000000000092eb56 in finish_xact_command () at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:2459
11
#8 0x000000000092cb37 in exec_simple_query (query_string=0x25d0de0 "insert
into x values(1,3),(1,4),(1,5),(1,6);") at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:1132
12
#9 0x000000000092bdf0 in PostgresMain (argc=1, argv=0x257f308,
dbname=0x257f168 "postgres", username=0x2550e10 "postgres") at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:4066
13
#10 0x0000000000879426 in BackendRun (port=0x2575650) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:4317
14
#11 0x0000000000878a50 in BackendStartup (port=0x2575650) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:3989
15
#12 0x000000000087509c in ServerLoop () at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:1729
16
#13 0x0000000000872612 in PostmasterMain (argc=3, argv=0x254ec80) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:1337
17
#14 0x0000000000795228 in main (argc=3, argv=0x254ec80) at
/home/zl/workspace_pg962/postgres/src/backend/main/main.c:228
I think it is impossible when i print something about the assertion.

1
(gdb) p MyProc->syncRepLinks
2
$1 = {prev = 0x0, next = 0x0}
So, what causes this assertion? To solve my doubts, i add some debug
log. See the macro DEBUG_SUNNY as below.

1
/*
2
* Wait for synchronous replication, if requested by user.
3
*
4
* Initially backends start in state SYNC_REP_NOT_WAITING and then
5
* change that state to SYNC_REP_WAITING before adding ourselves
6
* to the wait queue. During SyncRepWakeQueue() a WALSender changes
7
* the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
8
* This backend then resets its state to SYNC_REP_NOT_WAITING.
9
*
10
* 'lsn' represents the LSN to wait for. 'commit' indicates whether this
LSN
11
* represents a commit record. If it doesn't, then we wait only for the
WAL
12
* to be flushed if synchronous_commit is set to the higher level of
13
* remote_apply, because only commit records provide apply feedback.
14
*/
15
void
16
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
17
{
18
char *new_status = NULL;
19
const char *old_status;
20
int mode;
21
22
/* Cap the level for anything other than commit to remote flush only.
*/
23
if (commit)
24
mode = SyncRepWaitMode;
25
else
26
mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
27
28
/*
29
* Fast exit if user has not requested sync replication, or there are
no
30
* sync replication standby names defined. Note that those standbys
don't
31
* need to be connected.
32
*/
33
if (!SyncRepRequested() || !SyncStandbysDefined())
34
return;
35
36
Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
37
Assert(WalSndCtl != NULL);
38
39
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
40
Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
41
42
/*
43
* We don't wait for sync rep if WalSndCtl->sync_standbys_defined is
not
44
* set. See SyncRepUpdateSyncStandbysDefined.
45
*
46
* Also check that the standby hasn't already replied. Unlikely race
47
* condition but we'll be fetching that cache line anyway so it's
likely
48
* to be a low cost check.
49
*/
50
if (!WalSndCtl->sync_standbys_defined ||
51
lsn <= WalSndCtl->lsn[mode])
52
{
53
LWLockRelease(SyncRepLock);
54
return;
55
}
56
57
/*
58
* Set our waitLSN so WALSender will know when to wake us, and add
59
* ourselves to the queue.
60
*/
61
MyProc->waitLSN = lsn;
62
MyProc->syncRepState = SYNC_REP_WAITING;
63
SyncRepQueueInsert(mode);
64
Assert(SyncRepQueueIsOrderedByLSN(mode));
65
LWLockRelease(SyncRepLock);
66
67
/* Alter ps display to show waiting for sync rep. */
68
if (update_process_title)
69
{
70
int len;
71
72
old_status = get_ps_display(&len);
73
new_status = (char *) palloc(len + 32 + 1);
74
memcpy(new_status, old_status, len);
75
sprintf(new_status + len, " waiting for %X/%X",
76
(uint32) (lsn >> 32), (uint32) lsn);
77
set_ps_display(new_status, false);
78
new_status[len] = '\0'; /* truncate off " waiting ..." */
79
}
80
81
/*
82
* Wait for specified LSN to be confirmed.
83
*
84
* Each proc has its own wait latch, so we perform a normal latch
85
* check/wait loop here.
86
*/
87
for (;;)
88
{
89
/* Must reset the latch before testing state. */
90
ResetLatch(MyLatch);
91
92
/*
93
* Acquiring the lock is not needed, the latch ensures proper
94
* barriers. If it looks like we're done, we must really be done,
95
* because once walsender changes the state to
SYNC_REP_WAIT_COMPLETE,
96
* it will never update it again, so we can't be seeing a stale
value
97
* in that case.
98
*/
99
if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
100
break;
101
102
/*
103
* If a wait for synchronous replication is pending, we can
neither
104
* acknowledge the commit nor raise ERROR or FATAL. The latter
would
105
* lead the client to believe that the transaction aborted, which
is
106
* not true: it's already committed locally. The former is no good
107
* either: the client has requested synchronous replication, and
is
108
* entitled to assume that an acknowledged commit is also
replicated,
109
* which might not be true. So in this case we issue a WARNING
(which
110
* some clients may be able to interpret) and shut off further
output.
111
* We do NOT reset ProcDiePending, so that the process will die
after
112
* the commit is cleaned up.
113
*/
114
if (ProcDiePending)
115
{
116
ereport(WARNING,
117
(errcode(ERRCODE_ADMIN_SHUTDOWN),
118
errmsg("canceling the wait for synchronous replication
and terminating connection due to administrator command"),
119
errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));
120
whereToSendOutput = DestNone;
121
SyncRepCancelWait();
122
break;
123
}
124
125
/*
126
* It's unclear what to do if a query cancel interrupt arrives.
We
127
* can't actually abort at this point, but ignoring the interrupt
128
* altogether is not helpful, so we just terminate the wait with a
129
* suitable warning.
130
*/
131
if (QueryCancelPending)
132
{
133
QueryCancelPending = false;
134
ereport(WARNING,
135
(errmsg("canceling wait for synchronous replication due
to user request"),
136
errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));
137
SyncRepCancelWait();
138
break;
139
}
140
141
/*
142
* If the postmaster dies, we'll probably never get an
143
* acknowledgement, because all the wal sender processes will exit.
So
144
* just bail out.
145
*/
146
if (!PostmasterIsAlive())
147
{
148
ProcDiePending = true;
149
whereToSendOutput = DestNone;
150
SyncRepCancelWait();
151
break;
152
}
153
154
/*
155
* Wait on latch. Any condition that should wake us up will set
the
156
* latch, so no need for timeout.
157
*/
158
WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1);
159
}
160
161
#ifdef DEBUG_SUNNY
162
if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
163
ereport(LOG,
164
(errmsg("[SUNNY] It is impossible. [lsn] %X/%X [prev] %p [next]
%p",
165
(uint32) (MyProc->waitLSN >> 32), (uint32)
MyProc->waitLSN,
166
MyProc->syncRepLinks.prev, MyProc->syncRepLinks.next)));
167
#endif
168
169
/*
170
* WalSender has checked our LSN and has removed us from queue. Clean
up
171
* state and leave. It's OK to reset these shared memory fields
without
172
* holding SyncRepLock, because any walsenders will ignore us anyway
when
173
* we're not on the queue.
174
*/
175
Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
176
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
177
MyProc->waitLSN = 0;
178
179
if (new_status)
180
{
181
/* Reset ps display */
182
set_ps_display(new_status, false);
183
pfree(new_status);
184
}
185
}
186
187
/*
188
* Insert MyProc into the specified SyncRepQueue, maintaining sorted
invariant.
189
*
190
* Usually we will go at tail of queue, though it's possible that we
arrive
191
* here out of order, so start at tail and work back to insertion point.
192
*/
193
static void
194
SyncRepQueueInsert(int mode)
195
{
196
PGPROC *proc;
197
198
Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
199
proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[mode]),
200
&(WalSndCtl->SyncRepQueue[mode]),
201
offsetof(PGPROC, syncRepLinks));
202
203
while (proc)
204
{
205
/*
206
* Stop at the queue element that we should after to ensure the
queue
207
* is ordered by LSN.
208
*/
209
if (proc->waitLSN < MyProc->waitLSN)
210
break;
211
212
proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[mode]),
213
&(proc->syncRepLinks),
214
offsetof(PGPROC, syncRepLinks));
215
}
216
217
if (proc)
218
SHMQueueInsertAfter(&(proc->syncRepLinks),
&(MyProc->syncRepLinks));
219
else
220
SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue[mode]),
&(MyProc->syncRepLinks));
221
222
#ifdef DEBUG_SUNNY
223
ereport(LOG,
224
(errmsg("[SUNNY] Insert [lsn] %X/%X [prev] %p [next] %p",
225
(uint32) (MyProc->waitLSN >> 32), (uint32) MyProc->waitLSN,
226
MyProc->syncRepLinks.prev, MyProc->syncRepLinks.next)));
227
#endif
228
}
229
230
/*
231
* Walk the specified queue from head. Set the state of any backends that
232
* need to be woken, remove them from the queue, and then wake them.
233
* Pass all = true to wake whole queue; otherwise, just wake up to
234
* the walsender's LSN.
235
*
236
* Must hold SyncRepLock.
237
*/
238
static int
239
SyncRepWakeQueue(bool all, int mode)
240
{
241
volatile WalSndCtlData *walsndctl = WalSndCtl;
242
PGPROC *proc = NULL;
243
PGPROC *thisproc = NULL;
244
int numprocs = 0;
245
246
Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
247
Assert(SyncRepQueueIsOrderedByLSN(mode));
248
249
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
250
&(WalSndCtl->SyncRepQueue[mode]),
251
offsetof(PGPROC, syncRepLinks));
252
253
while (proc)
254
{
255
/*
256
* Assume the queue is ordered by LSN
257
*/
258
if (!all && walsndctl->lsn[mode] < proc->waitLSN)
259
return numprocs;
260
261
/*
262
* Move to next proc, so we can delete thisproc from the queue.
263
* thisproc is valid, proc may be NULL after this.
264
*/
265
thisproc = proc;
266
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
267
&(proc->syncRepLinks),
268
offsetof(PGPROC, syncRepLinks));
269
270
/*
271
* Remove thisproc from queue.
272
*/
273
SHMQueueDelete(&(thisproc->syncRepLinks));
274
275
/*
276
* Set state to complete; see SyncRepWaitForLSN() for discussion
of
277
* the various states.
278
*/
279
thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
280
281
#ifdef DEBUG_SUNNY
282
ereport(LOG,
283
(errmsg("[SUNNY] Delete [lsn] %X/%X [prev] %p [next] %p",
284
(uint32) (thisproc->waitLSN >> 32), (uint32)
thisproc->waitLSN,
285
thisproc->syncRepLinks.prev,
thisproc->syncRepLinks.next)));
286
#endif
287
/*
288
* Wake only when we have set state and removed from queue.
289
*/
290
SetLatch(&(thisproc->procLatch));
291
292
numprocs++;
293
}
294
295
return numprocs;
296
}

Then i made pressure test by benchmark, and got some log as below.

1
2017-06-28 02:51:33.868
CST,"benchmarksql","postgres",77171,"10.20.16.227:43879",595271c7.12d73,176627,"COMMIT
PREPARED",2017-06-27 22:55:03 CST,769/7909,0,LOG,00000,"[SUNNY] Insert [lsn]
32/7131AB58 [prev] 0x2b614633d6a8 [next] 0x2b61446f1b18",,,,,,"COMMIT
PREPARED 'T11860878'",,,"pgxc"
2
3
2017-06-28 02:51:33.868
CST,"benchmarksql","postgres",77171,"10.20.16.227:43879",595271c7.12d73,176628,"COMMIT
PREPARED waiting for 32/7131AB58",2017-06-27 22:55:03
CST,769/7909,0,LOG,00000,"[SUNNY] It is impossible. [lsn] 32/7131AB58 [prev]
0x2b614633d6a8 [next] 0x2b61446f1b18",,,,,,"COMMIT PREPARED
'T11860878'",,,"pgxc"
4
5
2017-06-28 02:51:33.870
CST,"pgxcn","",52856,"10.20.16.214:43102",59522322.ce78,4758326,"streaming
32/7131AB58",2017-06-27 17:19:30 CST,1/0,0,LOG,00000,"[SUNNY] Delete [lsn]
32/7131AB58 [prev] (nil) [next] (nil)",,,,,,,,,"slave"
You can find the "DELETE" log is later than the "IMPOSSIBLE" log. What
conditions does this happen under?

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
At last, i have made this bug reappear by GDB follow these steps.
1. In wal sender process, add a breakpoint at code line
"SHMQueueDelete(&(thisproc->syncRepLinks)); " of "SyncRepWakeQueue".

SUNNY
1
/*
2
* Walk the specified queue from head. Set the state of any backends that
3
* need to be woken, remove them from the queue, and then wake them.
4
* Pass all = true to wake whole queue; otherwise, just wake up to
5
* the walsender's LSN.
6
*
7
* Must hold SyncRepLock.
8
*/
9
static int
10
SyncRepWakeQueue(bool all, int mode)
11
{
12
volatile WalSndCtlData *walsndctl = WalSndCtl;
13
PGPROC *proc = NULL;
14
PGPROC *thisproc = NULL;
15
int numprocs = 0;
16
17
Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
18
Assert(SyncRepQueueIsOrderedByLSN(mode));
19
20
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
21
&(WalSndCtl->SyncRepQueue[mode]),
22
offsetof(PGPROC, syncRepLinks));
23
24
while (proc)
25
{
26
/*
27
* Assume the queue is ordered by LSN
28
*/
29
if (!all && walsndctl->lsn[mode] < proc->waitLSN)
30
return numprocs;
31
32
/*
33
* Move to next proc, so we can delete thisproc from the queue.
34
* thisproc is valid, proc may be NULL after this.
35
*/
36
thisproc = proc;
37
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
38
&(proc->syncRepLinks),
39
offsetof(PGPROC, syncRepLinks));
40
41
/*
42
* Set state to complete; see SyncRepWaitForLSN() for discussion
of
43
* the various states.
44
*/
45
thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
46
47
/*
48
* Remove thisproc from queue.
49
*/
50
SHMQueueDelete(&(thisproc->syncRepLinks));
51
52
#ifdef DEBUG_SUNNY
53
ereport(LOG,
54
(errmsg("[SUNNY] Delete [lsn] %X/%X [prev] %p [next] %p",
55
(uint32) (thisproc->waitLSN >> 32), (uint32)
thisproc->waitLSN,
56
thisproc->syncRepLinks.prev,
thisproc->syncRepLinks.next)));
57
#endif
58
/*
59
* Wake only when we have set state and removed from queue.
60
*/
61
SetLatch(&(thisproc->procLatch));
62
63
numprocs++;
64
}
65
66
return numprocs;
67
}
2. In backend process, add a breakpoint at code line "if
(MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)" of "SyncRepWaitForLSN".

1
/*
2
* Wait for synchronous replication, if requested by user.
3
*
4
* Initially backends start in state SYNC_REP_NOT_WAITING and then
5
* change that state to SYNC_REP_WAITING before adding ourselves
6
* to the wait queue. During SyncRepWakeQueue() a WALSender changes
7
* the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
8
* This backend then resets its state to SYNC_REP_NOT_WAITING.
9
*
10
* 'lsn' represents the LSN to wait for. 'commit' indicates whether this
LSN
11
* represents a commit record. If it doesn't, then we wait only for the
WAL
12
* to be flushed if synchronous_commit is set to the higher level of
13
* remote_apply, because only commit records provide apply feedback.
14
*/
15
void
16
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
17
{
18
char *new_status = NULL;
19
const char *old_status;
20
int mode;
21
22
/* Cap the level for anything other than commit to remote flush only.
*/
23
if (commit)
24
mode = SyncRepWaitMode;
25
else
26
mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
27
28
/*
29
* Fast exit if user has not requested sync replication, or there are
no
30
* sync replication standby names defined. Note that those standbys
don't
31
* need to be connected.
32
*/
33
if (!SyncRepRequested() || !SyncStandbysDefined())
34
return;
35
36
Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
37
Assert(WalSndCtl != NULL);
38
39
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
40
Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
41
42
/*
43
* We don't wait for sync rep if WalSndCtl->sync_standbys_defined is
not
44
* set. See SyncRepUpdateSyncStandbysDefined.
45
*
46
* Also check that the standby hasn't already replied. Unlikely race
47
* condition but we'll be fetching that cache line anyway so it's
likely
48
* to be a low cost check.
49
*/
50
if (!WalSndCtl->sync_standbys_defined ||
51
lsn <= WalSndCtl->lsn[mode])
52
{
53
LWLockRelease(SyncRepLock);
54
return;
55
}
56
57
/*
58
* Set our waitLSN so WALSender will know when to wake us, and add
59
* ourselves to the queue.
60
*/
61
MyProc->waitLSN = lsn;
62
MyProc->syncRepState = SYNC_REP_WAITING;
63
SyncRepQueueInsert(mode);
64
Assert(SyncRepQueueIsOrderedByLSN(mode));
65
LWLockRelease(SyncRepLock);
66
67
/* Alter ps display to show waiting for sync rep. */
68
if (update_process_title)
69
{
70
int len;
71
72
old_status = get_ps_display(&len);
73
new_status = (char *) palloc(len + 32 + 1);
74
memcpy(new_status, old_status, len);
75
sprintf(new_status + len, " waiting for %X/%X",
76
(uint32) (lsn >> 32), (uint32) lsn);
77
set_ps_display(new_status, false);
78
new_status[len] = '\0'; /* truncate off " waiting ..." */
79
}
80
81
/*
82
* Wait for specified LSN to be confirmed.
83
*
84
* Each proc has its own wait latch, so we perform a normal latch
85
* check/wait loop here.
86
*/
87
for (;;)
88
{
89
/* Must reset the latch before testing state. */
90
ResetLatch(MyLatch);
91
92
/*
93
* Acquiring the lock is not needed, the latch ensures proper
94
* barriers. If it looks like we're done, we must really be done,
95
* because once walsender changes the state to
SYNC_REP_WAIT_COMPLETE,
96
* it will never update it again, so we can't be seeing a stale
value
97
* in that case.
98
*/
99
if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
100
break;
101
102
/*
103
* If a wait for synchronous replication is pending, we can
neither
104
* acknowledge the commit nor raise ERROR or FATAL. The latter
would
105
* lead the client to believe that the transaction aborted, which
is
106
* not true: it's already committed locally. The former is no good
107
* either: the client has requested synchronous replication, and
is
108
* entitled to assume that an acknowledged commit is also
replicated,
109
* which might not be true. So in this case we issue a WARNING
(which
110
* some clients may be able to interpret) and shut off further
output.
111
* We do NOT reset ProcDiePending, so that the process will die
after
112
* the commit is cleaned up.
113
*/
114
if (ProcDiePending)
115
{
116
ereport(WARNING,
117
(errcode(ERRCODE_ADMIN_SHUTDOWN),
118
errmsg("canceling the wait for synchronous replication
and terminating connection due to administrator command"),
119
errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));
120
whereToSendOutput = DestNone;
121
SyncRepCancelWait();
122
break;
123
}
124
125
/*
126
* It's unclear what to do if a query cancel interrupt arrives.
We
127
* can't actually abort at this point, but ignoring the interrupt
128
* altogether is not helpful, so we just terminate the wait with a
129
* suitable warning.
130
*/
131
if (QueryCancelPending)
132
{
133
QueryCancelPending = false;
134
ereport(WARNING,
135
(errmsg("canceling wait for synchronous replication due
to user request"),
136
errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));
137
SyncRepCancelWait();
138
break;
139
}
140
141
/*
142
* If the postmaster dies, we'll probably never get an
143
* acknowledgement, because all the wal sender processes will exit.
So
144
* just bail out.
145
*/
146
if (!PostmasterIsAlive())
147
{
148
ProcDiePending = true;
149
whereToSendOutput = DestNone;
150
SyncRepCancelWait();
151
break;
152
}
153
154
/*
155
* Wait on latch. Any condition that should wake us up will set
the
156
* latch, so no need for timeout.
157
*/
158
WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1);
159
}
160
161
#ifdef DEBUG_SUNNY
162
if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
163
ereport(LOG,
164
(errmsg("[SUNNY] It is impossible. [lsn] %X/%X [prev] %p [next]
%p",
165
(uint32) (MyProc->waitLSN >> 32), (uint32)
MyProc->waitLSN,
166
MyProc->syncRepLinks.prev, MyProc->syncRepLinks.next)));
167
#endif
168
169
/*
170
* WalSender has checked our LSN and has removed us from queue. Clean
up
171
* state and leave. It's OK to reset these shared memory fields
without
172
* holding SyncRepLock, because any walsenders will ignore us anyway
when
173
* we're not on the queue.
174
*/
175
Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
176
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
177
MyProc->waitLSN = 0;
178
179
if (new_status)
180
{
181
/* Reset ps display */
182
set_ps_display(new_status, false);
183
pfree(new_status);
184
}
185
}
3. Execute a SQL whatever will generate tansaction log by psql.

.
1
[zl(at)INTEL175 ~/workspace_pg962/project]$ psql -p 8000 -U postgres
2
psql (9.6.2)
3
Type "help" for help.
4
5
postgres=# insert into x values(1,3),(1,4),(1,5),(1,6);
4. Hold the breakpoint in wal sender process and step next in backend
process. Then a assertion core file will be found.

1
(gdb) bt
2
#0 0x00007fa769cb41d7 in raise () from /lib64/libc.so.6
3
#1 0x00007fa769cb58c8 in abort () from /lib64/libc.so.6
4
#2 0x0000000000af0699 in ExceptionalCondition (conditionName=0xcdc111
"!(SHMQueueIsDetached(&(MyProc->syncRepLinks)))", errorType=0xb6c443
"FailedAssertion",
5
fileName=0xcdc140
"/home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c",
lineNumber=294) at
/home/zl/workspace_pg962/postgres/src/backend/utils/error/assert.c:54
6
#3 0x00000000008c7e94 in SyncRepWaitForLSN (lsn=50438264, commit=1 '\001')
at /home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c:294
7
#4 0x000000000056ed11 in RecordTransactionCommit () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:1343
8
#5 0x0000000000568c96 in CommitTransaction () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:2041
9
#6 0x0000000000568717 in CommitTransactionCommand () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:2768
10
#7 0x000000000092eb56 in finish_xact_command () at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:2459
11
#8 0x000000000092cb37 in exec_simple_query (query_string=0x1194de0 "insert
into x values(1,3),(1,4),(1,5),(1,6);") at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:1132
12
#9 0x000000000092bdf0 in PostgresMain (argc=1, argv=0x1143308,
dbname=0x1143168 "postgres", username=0x1114e10 "postgres") at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:4066
13
#10 0x0000000000879426 in BackendRun (port=0x1139650) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:4317
14
#11 0x0000000000878a50 in BackendStartup (port=0x1139650) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:3989
15
#12 0x000000000087509c in ServerLoop () at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:1729
16
#13 0x0000000000872612 in PostmasterMain (argc=3, argv=0x1112c80) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:1337
17
#14 0x0000000000795228 in main (argc=3, argv=0x1112c80) at
/home/zl/workspace_pg962/postgres/src/backend/main/main.c:228
18
(gdb)

Is this a bug? And how to slove it? Looking forward to your reply.
Thanks!
Sorry about my poor english O(∩_∩)O

Const Sunny
2017-6-29

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2017-06-29 04:27:00 Re: BUG #14721: Assertion of synchronous replication
Previous Message bricklen 2017-06-28 15:48:31 Re: Multi-Master Replication