From: | "K S, Sandhya (Nokia - IN/Bangalore)" <sandhya(dot)k_s(at)nokia(dot)com> |
---|---|
To: | 'Merlin Moncure' <mmoncure(at)gmail(dot)com> |
Cc: | "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>, "'tgl(at)sss(dot)pgh(dot)pa(dot)us'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'pgsql-bugs(at)postgresql(dot)org'" <pgsql-bugs(at)postgresql(dot)org>, "Itnal, Prakash (Nokia - IN/Bangalore)" <prakash(dot)itnal(at)nokia(dot)com>, "T, Rasna (Nokia - IN/Bangalore)" <rasna(dot)t(at)nokia(dot)com> |
Subject: | Re: Crash observed during the start of the Postgres process |
Date: | 2017-05-12 13:42:25 |
Message-ID: | AM5PR0701MB26425A2C55EA46164D5EDEDFD6E20@AM5PR0701MB2642.eurprd07.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Hi Merlin,
I have filtered the logs based on PID (19825) to see if this helps to debug the issue further. Here are some observations:
1. Though the restoration is complete (0000001C000000000000012F.done file created), we see some error logs related to restoration failure as below:
1: [19825-58f473ff.4d71-1] 2017-04-17 07:51:27.919 GMT < > LOG: 00000: database system was interrupted while in recovery at log time 2017-04-16 20:09:38 GMT
2: [19825-58f473ff.4d71-2] 2017-04-17 07:51:27.919 GMT < > HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
3: [19825-58f473ff.4d71-3] 2017-04-17 07:51:27.919 GMT < > LOCATION: StartupXLOG, xlog.c:4988
4: [19825-58f473ff.4d71-4] 2017-04-17 07:51:28.111 GMT < > DEBUG: 00000: standby_mode = 'true'
5: [19825-58f473ff.4d71-5] 2017-04-17 07:51:28.111 GMT < > LOCATION: readRecoveryCommandFile, xlog.c:4312
6: [19825-58f473ff.4d71-6] 2017-04-17 07:51:28.111 GMT < > DEBUG: 00000: primary_conninfo = 'host=CFPU-1 port=5433 user=_qnrnwdbman'
7: [19825-58f473ff.4d71-7] 2017-04-17 07:51:28.111 GMT < > LOCATION: readRecoveryCommandFile, xlog.c:4319
8: [19825-58f473ff.4d71-8] 2017-04-17 07:51:28.112 GMT < > DEBUG: 00000: trigger_file = '/mnt/DB_name/db_data/trigger_file'
9: [19825-58f473ff.4d71-9] 2017-04-17 07:51:28.112 GMT < > LOCATION: readRecoveryCommandFile, xlog.c:4326
10: [19825-58f473ff.4d71-10] 2017-04-17 07:51:28.112 GMT < > DEBUG: 00000: restore_command = 'exit 1'
11: [19825-58f473ff.4d71-11] 2017-04-17 07:51:28.112 GMT < > LOCATION: readRecoveryCommandFile, xlog.c:4186
12: [19825-58f473ff.4d71-12] 2017-04-17 07:51:28.112 GMT < > LOG: 00000: entering standby mode
13: [19825-58f473ff.4d71-13] 2017-04-17 07:51:28.112 GMT < > LOCATION: StartupXLOG, xlog.c:5057
14: [19825-58f473ff.4d71-14] 2017-04-17 07:51:28.112 GMT < > DEBUG: 00000: executing restore command "exit 1"
15: [19825-58f473ff.4d71-15] 2017-04-17 07:51:28.112 GMT < > LOCATION: RestoreArchivedFile, xlogarchive.c:201
16: [19825-58f473ff.4d71-16] 2017-04-17 07:51:28.133 GMT < > DEBUG: 00000: could not restore file "0000001C.history" from archive: return code 256
17: [19825-58f473ff.4d71-17] 2017-04-17 07:51:28.133 GMT < > LOCATION: RestoreArchivedFile, xlogarchive.c:304
18: [19825-58f473ff.4d71-18] 2017-04-17 07:51:28.134 GMT < > DEBUG: 00000: executing restore command "exit 1"
19: [19825-58f473ff.4d71-19] 2017-04-17 07:51:28.134 GMT < > LOCATION: RestoreArchivedFile, xlogarchive.c:201
20: [19825-58f473ff.4d71-20] 2017-04-17 07:51:28.156 GMT < > DEBUG: 00000: could not restore file "0000001C000000000000012F" from archive: return code 256
21: [19825-58f473ff.4d71-21] 2017-04-17 07:51:28.156 GMT < > LOCATION: RestoreArchivedFile, xlogarchive.c:304
22: [19825-58f473ff.4d71-22] 2017-04-17 07:51:28.156 GMT < > DEBUG: 00000: checkpoint record is at 0/4BDE6A88
23: [19825-58f473ff.4d71-23] 2017-04-17 07:51:28.156 GMT < > LOCATION: StartupXLOG, xlog.c:5182
24: [19825-58f473ff.4d71-24] 2017-04-17 07:51:28.156 GMT < > DEBUG: 00000: redo record is at 0/4BC2A998; shutdown FALSE
25: [19825-58f473ff.4d71-25] 2017-04-17 07:51:28.156 GMT < > LOCATION: StartupXLOG, xlog.c:5259
26: [19825-58f473ff.4d71-26] 2017-04-17 07:51:28.156 GMT < > DEBUG: 00000: next transaction ID: 0/6252; next OID: 71601
27: [19825-58f473ff.4d71-27] 2017-04-17 07:51:28.156 GMT < > LOCATION: StartupXLOG, xlog.c:526
1. Also, we see the below highlighted logs repeated multiple (count=7984) times during the DB operation. Can you please explain what these logs mean from Postgres perspective?
46: [19825-58f473ff.4d71-46] 2017-04-17 07:51:28.189 GMT < > DEBUG: 00000: record known xact 6252 latestObservedXid 6251
47: [19825-58f473ff.4d71-47] 2017-04-17 07:51:28.189 GMT < > CONTEXT: xlog redo AccessExclusive locks: xid 6252 db 16386 rel 16899
48: [19825-58f473ff.4d71-48] 2017-04-17 07:51:28.189 GMT < > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2673
49: [19825-58f473ff.4d71-49] 2017-04-17 07:51:28.190 GMT < > DEBUG: 00000: record known xact 6252 latestObservedXid 6252
50: [19825-58f473ff.4d71-50] 2017-04-17 07:51:28.190 GMT < > CONTEXT: xlog redo AccessExclusive locks: xid 6252 db 16386 rel 16941
51: [19825-58f473ff.4d71-51] 2017-04-17 07:51:28.190 GMT < > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2673
1. After these repeated messages, crash is observed with the below log message:
24491: [19825-58f473ff.4d71-24036] 2017-04-17 07:51:30.196 GMT < > CONTEXT: xlog redo insert: rel 1663/16386/11691; tid 13/266
24492: [19825-58f473ff.4d71-24037] 2017-04-17 07:51:30.196 GMT < > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2673
24493: [19825-58f473ff.4d71-24038] 2017-04-17 07:51:30.196 GMT < > DEBUG: 00000: record known xact 6271 latestObservedXid 6271
24494: [19825-58f473ff.4d71-24039] 2017-04-17 07:51:30.196 GMT < > CONTEXT: xlog redo delete: index 1663/16386/11692; iblk 9, heap 1663/16386/11689;
24495: [19825-58f473ff.4d71-24040] 2017-04-17 07:51:30.196 GMT < > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2673
24496: [19825-58f473ff.4d71-24041] 2017-04-17 07:51:30.196 GMT < > PANIC: XX000: btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data
24497: [19825-58f473ff.4d71-24042] 2017-04-17 07:51:30.196 GMT < > CONTEXT: xlog redo delete: index 1663/16386/11692; iblk 9, heap 1663/16386/11689;
24498: [19825-58f473ff.4d71-24043] 2017-04-17 07:51:30.196 GMT < > LOCATION: btree_xlog_delete_get_latestRemovedXid, nbtxlog.c:634
24765: [19810-58f473ff.4d62-191] 2017-04-17 07:51:33.348 GMT < > LOG: 00000: startup process (PID 19825) was terminated by signal 6: Aborted
I have attached the logfiles related to PID 19825, which can provide more insight on the observations explained as above.
Regards,
Sandhya
-----Original Message-----
From: K S, Sandhya (Nokia - IN/Bangalore)
Sent: Tuesday, April 25, 2017 8:24 PM
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org; tgl(at)sss(dot)pgh(dot)pa(dot)us; pgsql-bugs(at)postgresql(dot)org; Itnal, Prakash (Nokia - IN/Bangalore) <prakash(dot)itnal(at)nokia(dot)com>; T, Rasna (Nokia - IN/Bangalore) <rasna(dot)t(at)nokia(dot)com>
Subject: RE: [BUGS] Crash observed during the start of the Postgres process
Hi Merlin,
Below is the log captured when the crash was encountered.
STATEMENT: select count(1) from pg_ls_dir(current_setting('data_directory')) where pg_ls_dir = 'backup_label'
LOG: 00000: duration: 4.083 ms
LOCATION: exec_simple_query, postgres.c:1145
DEBUG: 00000: shmem_exit(0): 7 callbacks to make
LOCATION: shmem_exit, ipc.c:212
DEBUG: 00000: proc_exit(0): 3 callbacks to make
LOCATION: proc_exit_prepare, ipc.c:184
DEBUG: 00000: exit(0)
LOCATION: proc_exit, ipc.c:135
DEBUG: 00000: shmem_exit(-1): 0 callbacks to make
LOCATION: shmem_exit, ipc.c:212
DEBUG: 00000: proc_exit(-1): 0 callbacks to make
LOCATION: proc_exit_prepare, ipc.c:184
DEBUG: 00000: reaping dead processes
LOCATION: reaper, postmaster.c:2669
DEBUG: 00000: server process (PID 11104) exited with exit code 0
LOCATION: LogChildExit, postmaster.c:3385
DEBUG: 00000: reaping dead processes
LOCATION: reaper, postmaster.c:2669
LOG: 00000: startup process (PID 10265) was terminated by signal 6: Aborted
LOCATION: LogChildExit, postmaster.c:3407
LOG: 00000: terminating any other active server processes
LOCATION: HandleChildCrash, postmaster.c:3134
DEBUG: 00000: sending SIGQUIT to process 10994
LOCATION: HandleChildCrash, postmaster.c:3233
DEBUG: 00000: sending SIGQUIT to process 10269
LOCATION: HandleChildCrash, postmaster.c:3263
DEBUG: 00000: sending SIGQUIT to process 10268
LOCATION: HandleChildCrash, postmaster.c:3275
WARNING: 57P02: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
Backtrace of the core generated:
(gdb) bt
#0 0x0000005563bcf9c0 in raise () from /lib64/libc.so.6
#1 0x0000005563bd42bc in abort () from /lib64/libc.so.6
#2 0x000000012039e228 in errfinish ()
#3 0x000000012039ef08 in elog_finish ()
#4 0x000000012009eb08 in btree_redo ()
#5 0x00000001200caff8 in StartupXLOG ()
#6 0x0000000120259958 in StartupProcessMain ()
#7 0x00000001200d590c in AuxiliaryProcessMain ()
#8 0x0000000120253434 in ?? ()
Let me know if any further clarification/information is needed.
Regards,
Sandhya
-----Original Message-----
From: Merlin Moncure [mailto:mmoncure(at)gmail(dot)com]
Sent: Tuesday, April 25, 2017 8:01 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya(dot)k_s(at)nokia(dot)com<mailto:sandhya(dot)k_s(at)nokia(dot)com>>
Cc: pgsql-hackers(at)postgresql(dot)org<mailto:pgsql-hackers(at)postgresql(dot)org>; tgl(at)sss(dot)pgh(dot)pa(dot)us<mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us>; pgsql-bugs(at)postgresql(dot)org<mailto:pgsql-bugs(at)postgresql(dot)org>; Itnal, Prakash (Nokia - IN/Bangalore) <prakash(dot)itnal(at)nokia(dot)com<mailto:prakash(dot)itnal(at)nokia(dot)com>>; T, Rasna (Nokia - IN/Bangalore) <rasna(dot)t(at)nokia(dot)com<mailto:rasna(dot)t(at)nokia(dot)com>>
Subject: Re: [BUGS] Crash observed during the start of the Postgres process
On Tue, Apr 25, 2017 at 8:44 AM, K S, Sandhya (Nokia - IN/Bangalore)
<sandhya(dot)k_s(at)nokia(dot)com<mailto:sandhya(dot)k_s(at)nokia(dot)com>> wrote:
> Hello,
>
> Did you get a chance to take a look into the issue?
>
> Please consider it with high priority. We will be awaiting your inputs.
This email is heavily cross posted, which is obnoxious. Can you paste
the relevant log snippet?
merlin
Attachment | Content-Type | Size |
---|---|---|
postgresql-2017-04-17_075127.log | application/octet-stream | 3.2 MB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2017-05-12 14:20:02 | Re: [HACKERS] Concurrent ALTER SEQUENCE RESTART Regression |
Previous Message | sebastian.calbaza | 2017-05-12 13:00:33 | BUG #14648: counts for queries using array unnesting is incorrect |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-05-12 13:44:12 | Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur |
Previous Message | Peter Eisentraut | 2017-05-12 13:19:27 | Re: Addition of pg_dump --no-publications |