Re: Cluster seems broken after pg_basebackup

From: Guillaume Drolet <droletguillaume(at)gmail(dot)com>
To: Guillaume Lelarge <guillaume(at)lelarge(dot)info>, pgsql-general(at)postgresql(dot)org
Subject: Re: Cluster seems broken after pg_basebackup
Date: 2015-02-09 17:40:15
Message-ID: CAOkiyv7CLC=HhUqoegxKn4H_4vnUhgOYz5PHZVLNV-C5zJaKhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I tried starting the cluster again. Once again everything looked fine at
the start (first three lines of this log, in English this time):

2015-02-09 11:40:55 EST LOG: database system was shut down at 2015-02-06
09:50:21 EST
2015-02-09 11:40:55 EST LOG: database system is ready to accept connections
2015-02-09 11:40:55 EST LOG: autovacuum launcher started

Since it seemed to work, I opened the terminal and tried connecting to the
database:

C:\Users\admlocal>psql -U postgres -d mortalite

So far so good, I got a connection:

psql (9.3.5)
Attention : l'encodage console (850) diffère de l'encodage Windows (1252).
Les caractères 8 bits peuvent ne pas fonctionner correctement.
Voir la section « Notes aux utilisateurs de Windows » de la page
référence de psql pour les détails.
Saisissez « help » pour l'aide.

mortalite=#

I tried the help command and it worked:

mortalite=# help
Vous utilisez psql, l'interface en ligne de commande de PostgreSQL.
Saisissez:
\copyright pour les termes de distribution
\h pour l'aide-mémoire des commandes SQL
\? pour l'aide-mémoire des commandes psql
\g ou point-virgule en fin d'instruction pour exécuter la requête
\q pour quitter

But then when I tried to query the db, it crashed:

mortalite=# \dt
la connexion au serveur a été coupée de façon inattendue
Le serveur s'est peut-être arrêté anormalement avant ou durant le
traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
!>

And here's the rest of the log file after the crash:

2015-02-09 12:29:19 EST LOG: server process (PID 2240) was terminated by
exception 0xC0000005
2015-02-09 12:29:19 EST DETAIL: Failed process was running: SELECT
n.nspname as "Schema",
c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm'
THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence'
WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','')
AND n.nspname <> 'pg_catalog'
AND n.nspname <> 'information_schema'
AND n.nspname !~ '^pg_toast'
AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;
2015-02-09 12:29:19 EST HINT: See C include file "ntstatus.h" for a
description of the hexadecimal value.
2015-02-09 12:29:19 EST LOG: terminating any other active server processes
2015-02-09 12:29:19 EST WARNING: terminating connection because of crash
of another server process
2015-02-09 12:29:19 EST DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2015-02-09 12:29:19 EST HINT: In a moment you should be able to reconnect
to the database and repeat your command.
2015-02-09 12:29:19 EST LOG: archiver process (PID 4576) exited with exit
code 1
2015-02-09 12:29:19 EST LOG: all server processes terminated;
reinitializing
2015-02-09 12:29:29 EST FATAL: pre-existing shared memory block is still
in use
2015-02-09 12:29:29 EST HINT: Check if there are any old server processes
still running, and terminate them.

According to this page
<https://msdn.microsoft.com/en-ca/library/cc704588.aspx>, exception
0xC0000005 means STATUS_ACCESS_VIOLATION - The instruction at 0x%08lx
referenced memory at 0x%08lx. The memory could not be %s. This is not of
much help to me.

I hope these additional bits of information can help someone figuring out a
solution to get my cluster up and running again.

PS. I was thinking of reinstalling PGSQL over my current install but
keeping my PGDATA. I've done it in the past for fixing problems with
starting the service and it worked. What do you think?

Cheers,

Guillaume

2015-02-09 11:37 GMT-05:00 Guillaume Drolet <droletguillaume(at)gmail(dot)com>:

>
>
> 2015-02-07 1:24 GMT-05:00 Guillaume Lelarge <guillaume(at)lelarge(dot)info>:
>
>> Le 6 févr. 2015 17:31, "Adrian Klaver" <adrian(dot)klaver(at)aklaver(dot)com> a
>> écrit :
>> >
>> > On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
>> >>
>> >> Hi,
>> >>
>> >> Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
>> >> my cluster doesn't work properly. I tried restarting the computer (or
>> >> service) a few times but I always get the same messages in my logs
>> (it's
>> >> in French. If someone is willing to help me I can try to translate the
>> >> logs. Just ask):
>> >
>> >
>> > Enter Google Translate:)
>> >
>>
>> But first, Guillaume, do yourself and everyone else a favor: turn the dam
>> log into English. Set lc_messages to 'C' in postgresql.conf.
>>
> Thanks for this! I didn't know about this great feature.
>
>
>> > First some questions:
>> >
>> > 1) What Postgres version?
>> >
>> > 2) What OS(s)? I am assuming Windows from the log info below, but we
>> all know what assuming gets you.
>> >
>> > 3) Where were you backing up from and to?
>> >
>> > 4) Which cluster does not start, the master or the child you created
>> with pg_basebackup?
>> >
>> >
>> >>
>> >> 2015-02-06 07:11:38 EST LOG: le système de bases de données a été
>> >> interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
>> >> 2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas
>> été
>> >> arrêté proprement ; restauration
>> >> automatique en cours
>> >> 2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
>> >> 2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
>> >> 2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt
>> >> pour accepter les connexions
>> >> 2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
>> >> 2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas
>> >
>> >
>> > So where is role 208375PT$ supposed to come from?
>> >
>> >
>> >>
>> >> Then if I start pgAdmin I get a series of pop-ups I have to click OK to
>> >> to continue:
>> >>
>> >> An error has ocurred: Column not found in pgSet: "datlastsysoid"
>> >> An error has ocurred: Column not found in pgSet: datlastsysoid
>> >> An error has ocurred: Column not found in pgSet: oid
>> >> An error has ocurred: Column not found in pgSet: encoding
>> >> An error has ocurred: Column not found in pgSet: Connection to database
>> >> broken
>> >
>> >
>> > Not sure about that this, someone more versed in pgAdmin will have to
>> answer.
>> >
>>
>> Usually you see these messages when you're using a pgadmin major release
>> older than a PostgreSQL make release. For a 9.3 release, that would mean a
>> pgadmin older than 1.18.
>>
>
> I'm running pgadmin 1.18.1
>
>> >
>> >>
>> >> And after that, I went back to the log file and there's new information
>> >> added:
>> >>
>> >> 2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté
>> >> par l'exception 0x80000004
>> >> 2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait :
>> >> SELECT version();
>> >> 2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C «
>> >> ntstatus.h » pour une description de la valeur
>> >> hexadécimale.
>> >
>> >
>> > Well according to here:
>> >
>> > https://msdn.microsoft.com/en-us/library/cc704588.aspx
>> >
>> > 0x80000004
>> > STATUS_SINGLE_STEP
>> >
>> >
>> > {EXCEPTION} Single Step A single step or trace operation has just been
>> completed.
>> >
>> > A developer is going to have explain what that means.
>> >
>> >
>> >
>> >> 2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
>> >> 2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de
>> >> l'arrêt brutal d'un autre processus serveur
>> >> 2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce
>> processus
>> >> serveur d'annuler la transaction
>> >> courante et de quitter car un autre processus serveur a quitté
>> >> anormalement
>> >> et qu'il existe probablement de la mémoire partagée corrompue.
>> >> 2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être
>> >> capable de vous reconnecter à la base de
>> >> données et de relancer votre commande.
>> >> 2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte
>> >> avec le code de sortie 1
>> >> 2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont
>> >> arrêtés, réinitialisation
>> >> 2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant
>> >> est toujours en cours d'utilisation
>> >> 2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux
>> >> processus serveur en cours d'exécution. Si c'est le
>> >> cas, fermez-les.
>> >>
>> >> I was about to try restarting postgresql using the base backup I made
>> >> yesterday but since this means I'll have to copy my database again (700
>> >> GB takes a while...) I am looking for a better solution from more
>> >> experienced people.
>> >
>> >
>> >
>> > My suspicion is you copied at least partly over a running server.
>> >
>>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Guillaume Lelarge 2015-02-09 17:51:33 Re: Cluster seems broken after pg_basebackup
Previous Message Brent Wood 2015-02-09 17:21:22 Re: Partioning with overlapping and non overlapping constraints