Re: [COMMITTERS] pgsql: Perform only one ReadControlFile() during startup.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [COMMITTERS] pgsql: Perform only one ReadControlFile() during startup.
Date: 2017-09-16 14:32:29
Message-ID: 14134.1505572349@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> Perform only one ReadControlFile() during startup.

This patch or something closely related to it has broken the postmaster's
ability to recover from a backend crash. For example, after exercising
the backend crash Andreas just reported:

regression=# select from information_schema.user_mapping_options;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q

attempting to reconnect fails, because the postmaster isn't there.
It left a core file behind though, in which I find

Program terminated with signal 11, Segmentation fault.
#0 0x000000000088b792 in GetMemoryChunkContext (pointer=0x7fdb36fc3f00)
at ../../../../src/include/utils/memutils.h:124
124 AssertArg(MemoryContextIsValid(context));
(gdb) bt
#0 0x000000000088b792 in GetMemoryChunkContext (pointer=0x7fdb36fc3f00)
at ../../../../src/include/utils/memutils.h:124
#1 pfree (pointer=0x7fdb36fc3f00) at mcxt.c:951
#2 0x0000000000512843 in XLOGShmemInit () at xlog.c:4897
#3 0x0000000000737fd9 in CreateSharedMemoryAndSemaphores (
makePrivate=0 '\000', port=5440) at ipci.c:220
#4 0x00000000006e4a78 in reset_shared () at postmaster.c:2516
#5 PostmasterStateMachine () at postmaster.c:3832
#6 0x00000000006e541d in reaper (postgres_signal_arg=<value optimized out>)
at postmaster.c:3081
#7 <signal handler called>
#8 0x0000003b78ae1603 in __select_nocancel ()
at ../sysdeps/unix/syscall-template.S:82
#9 0x00000000008a432a in pg_usleep (microsec=<value optimized out>)
at pgsleep.c:56
#10 0x00000000006e75d7 in ServerLoop (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:1705
#11 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:1364

It's dying at "pfree(localControlFile)". localControlFile seems to
be pointing at a region of memory that's entirely zeroes; certainly
the data that it just moved into shared memory is all zeroes.
It looks like someone didn't think hard enough about when to reset
ControlFile to null.

regards, tom lane

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Bruce Momjian 2017-09-16 15:58:13 pgsql: docs: clarify pg_upgrade docs regarding standbys and rsync
Previous Message Andreas Seltenreich 2017-09-16 10:55:46 Re: pgsql: Expand partitioned table RTEs level by level, without flattening

Browse pgsql-hackers by date

  From Date Subject
Next Message chenhj 2017-09-16 14:56:07 [PATCH]make pg_rewind to not copy useless WAL files
Previous Message Gerdan Santos 2017-09-16 14:23:10 Re: Variable substitution in psql backtick expansion