Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

From: "Imseih (AWS), Sami" <simseih(at)amazon(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Date: 2022-08-08 15:51:30
Message-ID: 6A726564-E016-4305-97AB-F0E825B3B53C@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> The server seem to have started as a standby after crashing a
> primary. Is it correct?

Yes, that is correct.

2022-08-05 17:18:51 UTC::@:[359]:LOG: database system was interrupted; last known up at 2022-08-05 17:08:52 UTC
2022-08-05 17:18:51 UTC::@:[359]:LOG: creating missing WAL directory "pg_wal/archive_status"
recovering 00000002.history
2022-08-05 17:18:51 UTC::@:[359]:LOG: entering standby mode
2022-08-05 17:18:51 UTC::@:[359]:LOG: ### [S] @0/48A3400: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: checkpoint record is at 0/48A3388
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: redo record is at 0/48A3388; shutdown true
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: next transaction ID: 533; next OID: 16395
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: next MultiXactId: 1; next MultiXactOffset: 0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: oldest unfrozen transaction ID: 479, in database 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: oldest MultiXactId: 1, in database 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: commit timestamp Xid oldest/newest: 0/0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: transaction ID wrap limit is 2147484126, limited by database with OID 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: starting up replication slots
2022-08-05 17:18:51 UTC::@:[359]:LOG: database system was not properly shut down; automatic recovery in progress
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: resetting unlogged relations: cleanup 1 init 0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: initializing for hot standby
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: recovery snapshots are now enabled
2022-08-05 17:18:51 UTC::@:[359]:LOG: redo starts at 0/48A3400
2022-08-05 17:18:51 UTC::@:[359]:LOG: ### [F] @0/10000000: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG: reached end of WAL in pg_wal, entering archive recovery
2022-08-05 17:18:51 UTC::@:[359]:LOG: consistent recovery state reached at 0/10000000
2022-08-05 17:18:51 UTC::@:[357]:LOG: database system is ready to accept read only connections

> Archive recovery ended here. The server should have promoted that
> time.. Do you see some interesting log lines around this time?

The server did promote around that time

2022-08-05 18:38:14 UTC::@:[359]:LOG: ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1, SbyModeReq=1
2022-08-05 18:38:14 UTC::@:[359]:LOG: received promote request
2022-08-05 18:38:14 UTC::@:[359]:LOG: redo done at 6/B6CB27A8
2022-08-05 18:38:14 UTC::@:[359]:LOG: last completed transaction was at log time 2022-08-05 18:38:00.832047+00
recovering 00000001000000060000002D
....
........
2022-08-05 18:38:14 UTC::@:[359]:LOG: ### [S] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
recovering 00000002.history
2022-08-05 18:38:14 UTC::@:[359]:LOG: selected new timeline ID: 2
2022-08-05 18:38:14 UTC::@:[359]:LOG: archive recovery complete
recovering 00000001.history
2022-08-05 18:38:14 UTC::@:[357]:LOG: database system is ready to accept connections
2022-08-05 18:38:15 UTC::@:[367]:LOG: restartpoint complete: wrote 21388 buffers (2.1%); 0 WAL file(s) added, 9 removed, 0 recycled; write=98.394 s, sync=0.041 s, total=98.586 s; sync files=46, longest=0.012 s, average=0.001 s; distance=1048565 kB, estimate=1048584 kB
2022-08-05 18:38:15 UTC::@:[367]:LOG: recovery restart point at 6/5C0003B0

> But, recovery continues in non-standby mode. I don't see how come it
> behaves that way.

But the server crashes sometime after which is why recovery starts.

022-08-05 18:50:13 UTC::@:[357]:LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-08-05 18:50:13 UTC::@:[357]:LOG: listening on IPv6 address "::", port 5432
2022-08-05 18:50:13 UTC::@:[357]:LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-08-05 18:50:13 UTC::@:[359]:LOG: database system was interrupted; last known up at 2022-08-05 18:38:15 UTC
2022-08-05 18:50:13 UTC::@:[359]:LOG: creating missing WAL directory "pg_wal/archive_status"
recovering 00000003.history
2022-08-05 18:50:13 UTC::@:[359]:LOG: entering standby mode
recovering 00000002.history
2022-08-05 18:50:13 UTC::@:[359]:LOG: ### [S] @6/B8000198: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 18:50:13 UTC::@:[359]:LOG: database system was not properly shut down; automatic recovery in progress
2022-08-05 18:50:13 UTC::@:[359]:LOG: redo starts at 6/B80000E8

And a few hours later, is when we see a panic

Thanks

--
Sami Imseih
Amazon Web Services (AWS)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2022-08-08 15:53:19 Re: shared-memory based stats collector - v70
Previous Message Jonathan S. Katz 2022-08-08 15:50:16 2022-08-11 release announcement draft