Re: Global Deadlock Information

From: Satoshi Nagayasu <satoshi(dot)nagayasu(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: pgsql-cluster-hackers(at)postgresql(dot)org
Subject: Re: Global Deadlock Information
Date: 2010-02-06 16:05:34
Message-ID: 4B6D934E.1000204@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-cluster-hackers

Hi Markus,

I attempted in two ways to resolve global deadlock situation
in the PostgresForest development.

(1) Use the lock_timeout to avoid from a global deadlock.

The lock_timeout feature is a very simple way to avoid
from the global deadlock situation.

I disagree "statement_timeout is the way to avoid global
deadlocks" too, because the statement_timeout kills
the healthy/long-running transaction by its timeout.

Some developers (including me!) proposed the lock_timeout
GUC option.

http://archives.postgresql.org/pgsql-hackers/2004-06/msg00935.php
http://archives.postgresql.org/pgsql-hackers/2010-01/msg01167.php

I still believe the "lock timeout" feature could help
resolving a global deadlock in the cluster environment.

(2) Use the global wait-for graph to detect a global deadlock.

I had an experimental implemetation to use the global wait-for
graph to prevent the global deadlock.

http://en.wikipedia.org/wiki/Deadlock#Distributed_deadlock

I used the node(server) identifiers and the pg_locks information
to build the global wait-for graph, and the kill signal
(or pg_cancel()?) to abort a victim transaction causing
the deadlock.

I don't think the callback function is needed to replace
the current deadlock resolution feature,
but I agree we need a consensus how we could avoid
the global deadlock situation in the cluster.

Thanks,

On 2010/02/06 18:13, Markus Wanner wrote:
> Hi,
>
> I'd like to start a thread for discussion of the second item on the
> ClusterFeatures [1] list: Global Deadlock Information.
>
> IIRC there are two aspects to this item: a) the plain notification of a
> deadlock and b) some way to control or intercept deadlock resolution.
>
> The problem this item seems to address is the potential for deadlocks
> between transactions on different nodes. Or put another way: between a
> local transaction and one that's to be applied from a remote node (or
> even between two remote ones - similar issue, though). To ensure
> congruency between nodes, they must take the same measures to resolve
> the deadlock, i.e. abort the same transaction(s).
>
> I certainly disagree with the statement on the wiki that the
> "statement_timeout is the way to avoid global deadlocks", because I
> don't want to have to wait that long until a deadlock gets resolved.
> Further it doesn't even guarantee congruency, depending on the
> implementation of your clustering solution.
>
> I fail to see how a plain notification API would help much. After all,
> this could result in one node notifying having aborted transaction A to
> resolve a deadlock while another node notifies having aborted
> transaction B. You'd end up having to abort two (or more) transaction
> instead of just one to resolve a conflict.
>
> It could get more useful, if enabling such a notification would turn off
> the existing deadlock resolver and leave the resolution of the deadlock
> to the clustering solution. I'd call that an interception.
>
> Such an interception API should IMO provide a way to register a
> callback, which replaces the current deadlock resolver. Upon detection
> of a deadlock, the callback should get a list of transaction ids that
> are part of the lock cycle. It's then up to that callback, to chose one
> and abort that to resolve the conflict.
>
> And now, Greg's List:
> > 1) What feature does this help add from a user perspective?
>
> Preventing cluster-wide deadlocks (while maintaining congruency of
> replicas).
>
> > 2) Which replication projects would be expected to see an improvement
> > from this addition?
>
> I suspect all multi-master solutions are affected, certainly Postgres-R
> would benefit. Single-master ones certainly don't need it.
>
> > 3) What makes it difficult to implement?
>
> I don't see any real stumbling block. Deciding on an API needs consensus.
>
> > 4) Are there any other items on the list this depends on, or that it
> > is expected to have a significant positive/negative interaction with?
>
> Not that I know of.
>
> > 5) What replication projects include a feature like this already, or a
> > prototype of a similar one, that might be used as a proof of concept
> > or example implementation?
>
> Old Postgres-R versions once had such an interception, but it currently
> lacks a solution for this problem. I don't know of any other project
> that's already solved this.
>
> > 6) Who is already working on it/planning to work on it/needs it for
> > their related project?
>
> I'm not currently working on it and don't plan to do so (at least) until
> PgCon 2010.
>
>
> Cluster hackers, is this a good summary which covers your needs as well?
> Something missing?
>
> Regards
>
> Markus Wanner
>
> [1]: feature wish list of cluster hackers:
> http://wiki.postgresql.org/wiki/ClusterFeatures
>
>

--
NAGAYASU Satoshi <satoshi(dot)nagayasu(at)gmail(dot)com>

In response to

Responses

Browse pgsql-cluster-hackers by date

  From Date Subject
Next Message Koichi Suzuki 2010-02-06 16:23:36 Re: Global Deadlock Information
Previous Message Markus Wanner 2010-02-06 09:13:02 Global Deadlock Information