Re: Parallel pg_dump's error reporting doesn't work worth squat

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parallel pg_dump's error reporting doesn't work worth squat
Date: 2016-06-01 06:09:43
Message-ID: 20160601.150943.114823815.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 31 May 2016 12:29:50 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <7445(dot)1464712190(at)sss(dot)pgh(dot)pa(dot)us>
> Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> > At Fri, 27 May 2016 13:20:20 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <14603(dot)1464369620(at)sss(dot)pgh(dot)pa(dot)us>
> >> Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> >>> By the way, the reason of the "invalid snapshot identifier" is
> >>> that some worker threads try to use it after the connection on
> >>> the first worker closed.
>
> >> ... BTW, I don't quite see what the issue is there.
>
> > The master session died from lack of libz and the failure of
> > compressLevel's propagation already fixed. Some of the children
> > that started transactions after the master's death will get the
> > error.
>
> I don't think I believe that theory, because it would require the master
> to not notice the lack of libz before it launches worker processes, but
> instead while the workers are working.

The master actually *didn't* notice the lack of libz until it
launces worker processes before cae2bb1. So the current master
don't suffer the problem, but it is not desirable that sudden
death from any reason of a child causes this kind of behavior.

> But AFAICS, while there are worker
> processes open, the master does nothing except wait for workers and
> dispatch new jobs to them; it does no database work of its own. So the
> libz-isn't-there error has to have occurred in one of the workers.

Yes, the firstly-commanded worker dies from that then the master
disconencts its connection owning the snapshot before terminating
any other workers. It occurs with the current master (9ee56df)
minus cae2bb1, having --without-zlib at configure.

> > If we want prevent it perfectly, one solution could be that
> > non-master children explicitly wait the master to arrive at the
> > "safe" state before starting their transactions. But I suppose it
> > is not needed here.
>
> Actually, I believe the problem is in archive_close_connection, around
> line 295 in HEAD: once the master realizes that one child has failed,
> it first closes its own database connection and only second tries to kill
> the remaining children. So there's a race condition wherein remaining
> children have time to see the missing-snapshot error.

Agreed.

> In the patch I posted yesterday, I reversed the order of those two
> steps, which should fix this problem in most scenarios:
> https://www.postgresql.org/message-id/7005.1464657274@sss.pgh.pa.us

Yeah, just transposing DisconnectDatabase and ShutdownWorkersHard
in archive_close_connection fixed the problem.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2016-06-01 07:50:09 Re: Reviewing freeze map code
Previous Message Etsuro Fujita 2016-06-01 05:58:09 Re: foreign table batch inserts