Re: Parallel pg_dump's error reporting doesn't work worth squat

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parallel pg_dump's error reporting doesn't work worth squat
Date: 2016-05-31 16:29:50
Message-ID: 7445.1464712190@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> At Fri, 27 May 2016 13:20:20 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <14603(dot)1464369620(at)sss(dot)pgh(dot)pa(dot)us>
>> Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> writes:
>>> By the way, the reason of the "invalid snapshot identifier" is
>>> that some worker threads try to use it after the connection on
>>> the first worker closed.

>> ... BTW, I don't quite see what the issue is there.

> The master session died from lack of libz and the failure of
> compressLevel's propagation already fixed. Some of the children
> that started transactions after the master's death will get the
> error.

I don't think I believe that theory, because it would require the master
to not notice the lack of libz before it launches worker processes, but
instead while the workers are working. But AFAICS, while there are worker
processes open, the master does nothing except wait for workers and
dispatch new jobs to them; it does no database work of its own. So the
libz-isn't-there error has to have occurred in one of the workers.

> If we want prevent it perfectly, one solution could be that
> non-master children explicitly wait the master to arrive at the
> "safe" state before starting their transactions. But I suppose it
> is not needed here.

Actually, I believe the problem is in archive_close_connection, around
line 295 in HEAD: once the master realizes that one child has failed,
it first closes its own database connection and only second tries to kill
the remaining children. So there's a race condition wherein remaining
children have time to see the missing-snapshot error.

In the patch I posted yesterday, I reversed the order of those two
steps, which should fix this problem in most scenarios:
https://www.postgresql.org/message-id/7005.1464657274@sss.pgh.pa.us

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-05-31 16:30:53 Re: [PATCH][Documination] Add optional USING keyword before opclass name in INSERT statemet
Previous Message Hendrik Visage 2016-05-31 16:22:19 Re: Suggestion for --truncate-tables to pg_restore