Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.

From: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
Date: 2015-10-05 08:47:24
Message-ID: CAK-MWwS=MvYN=4VQFYbxoMiqUsnCXpc0JF9ChyP+opvhkcrD6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Oct 5, 2015 at 7:15 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

> On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier <
> michael(dot)paquier(at)gmail(dot)com>
> > wrote:
> >>
> >>
> >>
> >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
> wrote:
> >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting
> for
> >> >> something.
> >> >> And query waiting for AccessShareLock on the same relation.
> >> >>
> >> > gdb backtrace from stuck startup process:
> >> > (gdb) bt
> >> > #0 0x00007f04ad862633 in select () from
> /lib/x86_64-linux-gnu/libc.so.6
> >> > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at
> >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53
> >> > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171
> >> > #3 ResolveRecoveryConflictWithVirtualXIDs
> >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0)
> at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232
> >> > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0,
> >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191
> >> > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80,
> >> > lsn=107351881751648) at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329
> >>
> >> This backtrace is not indicating that this process is waiting on a
> >> relation lock, it is resolving a recovery conflict while removing
> tuples,
> >> killing the virtual transaction depending on if
> max_standby_streaming_delay
> >> or max_standby_archive_delay are set if the conflict gets longer. Did
> you
> >> change the default of those parameters, which is 30s, to -1? This would
> mean
> >> that the standby waits indefinitely.
> >
> > While setting it to -1 gives the startup process permission to wait
> > indefinitely for another back-end which is doing something, I don't think
> > that that means it should have permission to deadlock indefinitely. A
> > deadlock is a different kind of thing that "someone started a transaction
> > and then left on vacation for a month without closing it"
>
> Looking eat that more closely... We know that a startup process is the
> only process in a standby that can take an exclusive lock on a
> relation, per LockAcquireExtended in lock.c and that this lock is
> taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by
> Maxim indicates that the deadlock is showing up in heap_xlog_clean,
> which may happen in the case of a VACUUM for example. It seems to me
> that we could have a deadlock if some WAL records had their order
> swapped. For example, the EXCLUSIVE lock is replayed first and the
> cleanup of relation pages happens before the lock has been released,
> freezing the other backend using pg_relation_size on a standby waiting
> for the access share lock indefinitely. This would indeed prevent
> recovery conflict to work properly, because we should not hold an
> exclusive lock on the relation while replaying heap_xlog_clean. Hence
> it seems to me that we may have an issue with 68a2e52 (XLogInsert
> multi-slot), and it has been introducd in 9.4.
>
> We could perhaps think about adding some safeguards in
> ResolveRecoveryConflictWithSnapshot.
>
> Maxim, are you using commit_delay? It may be possible that this is
> fixed by bab9599 that has been pushed after 9.4.4 was released. Could
> you also check if HEAD or the soon-to-be-released 9.4.5 fix your
> problem?
> Regards,
> --
> Michael
>

Hi Michael,

commit_delay isn't used on master server or replicas.
I would like to check 9.4.5 but I prefer wait for release and not build it
by hands, and from my experience building from git tree could be tricky as
I don't have root access to install additional packages on the server.
However, I will try it in next few days (if I get an permission).

--
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message wjarl 2015-10-05 15:47:16 BUG #13663: uuid-ossp in multiple schemas same database
Previous Message Michael Paquier 2015-10-05 08:15:15 Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.