Re: Standby corruption after master is restarted

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: emre(at)hasegeli(dot)com, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Cc: gurkan(dot)gur(at)innogames(dot)com, david(dot)pusch(at)innogames(dot)com, patrick(dot)schmidt(at)innogames(dot)com
Subject: Re: Standby corruption after master is restarted
Date: 2018-04-14 17:38:49
Message-ID: ce06163c-58ed-5dda-ea5c-138c86b62132@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi Emre,

On 03/28/2018 07:50 PM, Emre Hasegeli wrote:
> We experienced this issue again, this time on production. The primary
> instance was in a loop of being killed by Linux OOM-killer and being
> restarted in 1 minute intervals. The corruption only happened on one
> of the two standbys. The primary and the other standby have no
> problems. Only the was killed and restarted, the standbys were not.
> There weren't any unusual settings, "fsync" was not disabled. Here is
> the information I collected.
>

I've been trying to reproduce this by running a master with a couple of
replicas, and randomly restarting the master (while pgbench is running).
But so far no luck, so I guess something else is required to reproduce
the behavior ...

> The logs at the time standby broke:
>
>> 2018-03-28 14:00:30 UTC [3693-67] LOG: invalid resource manager ID 39 at 1DFB/D43BE688
>> 2018-03-28 14:00:30 UTC [25347-1] LOG: started streaming WAL from primary at 1DFB/D4000000 on timeline 5
>> 2018-03-28 14:00:59 UTC [3748-357177] LOG: restartpoint starting: time
>> 2018-03-28 14:01:23 UTC [25347-2] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected
>> 2018-03-28 14:01:24 UTC [3693-68] FATAL: invalid memory alloc request size 1916035072
>
> And from the next try:
>
>> 2018-03-28 14:02:15 UTC [26808-5] LOG: consistent recovery state reached at 1DFB/D6BDDFF8
>> 2018-03-28 14:02:15 UTC [26808-6] FATAL: invalid memory alloc request size 191603507
>

In the initial report (from August 2018) you shared pg_xlogdump output,
showing that the corrupted WAL record is an FPI_FOR_HINT right after
CHECKPOINT_SHUTDOWN. Was it the same case this time?

BTW which versions are we talking about? I see the initial report
mentioned catversion 201608131, this one mentions 201510051, so I'm
guessing 9.6 and 9.5. Which minor versions?

Is the master under load (accepting writes) before shutdown?

How was it restarted, actually? I see you're mentioning OOM killer, so I
guess "kill -9". What about the first report - was it the same case, or
was it restarted "nicely" using pg_ctl?

Could the replica receive the WAL in some other way - say, from a WAL
archive? What archive/restore commands you use?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2018-04-14 17:46:45 Re: Standby corruption after master is restarted
Previous Message PG Bug reporting form 2018-04-14 10:39:21 BUG #15155: table_to_xmlschema() ignores string restriction when generating XSD

Browse pgsql-hackers by date

  From Date Subject
Next Message David Arnold 2018-04-14 17:42:14 Re: Proposal: Adding json logging
Previous Message Peter Geoghegan 2018-04-14 17:36:30 Re: MinIndexTupleSize seems slightly wrong